Statistical computing with r solutions pdf download






















Several examples using the graphics functions in Table 1. See Table 4. Available plotting characters are shown in the manual [, Ch. The following produces a display of colors. For example, plot. Use colors to see the vector of named colors. Where color palettes would normally be used, we have substituted a grayscale palette.

A table of plotting characters is produced by show. A utility to display available colors in R is show. Also see show. Many introductory and more advanced texts can be recommended for review and reference. On introductory probability see e. Bean [23], Ghahramani [], or Ross [].

Casella and Berger [39] or Bain and Englehart [16] are somewhat more advanced. Durrett [77] is a graduate probability text. Lehmann [] and Lehmann and Casella [] are graduate texts in statistical inference. We will omit the subscript X and write F x if it is clear in context. The cdf has the following properties: 1. FX is non-decreasing. A random variable X is discrete if FX is a step function. In the remainder of this chapter, for simplicity fX x denotes either the pdf if X is continuous or the pmf if X is discrete of X.

Expectation, Variance, and Moments The mean of a random variable X is the expected value or mathematical expectation of the variable, denoted E[X]. The rth moment of X is E[X r ].

The square root of the variance is the standard deviation. The reciprocal of the variance is the precision. The random variables X1 ,. That is, X1 ,. How- ever, the converse is not true; uncorrelated variables are not necessarily inde- pendent. Then the following properties hold provided the moments exist. Apply properties 2, 7, and 5 above. Ross [, Ch. Three important counting distri- butions are the binomial and Bernoulli , negative binomial and geometric , and Poisson. Several discrete distributions including the binomial, geometric, and neg- ative binomial distributions can be formulated in terms of the outcomes of Bernoulli trials.

A sequence of Bernoulli trials is a sequence of outcomes X1 , X2 ,. Binomial and Multinomial Distribution Suppose that X records the number of successes in n iid Bernoulli trials with success probability p. The mean and variance formulas are easily derived by observing that that the binomial variable is an iid sum of n Bernoulli p variables.

A binomial distribution is a special case of a multinomial distribution. Let Xj record the number of times that event Aj occurs in n independent and identical trials of the experiment. Geometric Distribution Consider a sequence of Bernoulli trials, with success probability p.

Negative Binomial Distribution The negative binomial frequency model applies in the same setting as a geometric model, except that the variable of interest is the number of failures until the rth success. Suppose that exactly X failures occur before the rth success.

Note that 2. If X has pmf 2. Then X is the iid sum of r Geom p variables. Note that like the geometric random variable, there is an alternative formu- lation of the negative binomial model that counts the number of trials until the rth success.

The Poisson distribution has many important properties and applications see e. Examples Example 2. The last equality follows because the summand is the Poisson pmf and the total probability must sum to 1.

We summarize some of these properties, without proof. For more properties and characteri- zations see [, Ch. A linear transformation of a normal variable is also normally distributed. Therefore, if X1 ,. In case the sampled distribution is not normal, but the sample size is large, the Central Limit Theorem implies that the distribution of Y is approximately normal.

See Section 2. If X1 ,. If Z1 ,. In Bayesian analysis, a beta distribution is often chosen to model the dis- tribution of a probability parameter, such as the probability of success in Bernoulli trials or a binomial experiment.

Denote the distribution with density function 2. Some properties of the bivari- ate normal distribution 2. Then the bivariate normal pdf 2. The multivariate normal distribution The joint distribution of continuous random variables X1 ,. In fact, all of the marginal distributions of a multivariate normal vector are multivariate normal see e.

Tong [, Sec. The normal random variables X1 ,. Linear transformations of multivariate normal random vectors are multi- variate normal. Applications and properties of the multivariate normal distribution are cov- ered by Anderson [8] and Mardia et al.

Refer to Tong [] for prop- erties and characterizations of the bivariate normal and multivariate normal distribution. Probability and Statistics Review 35 2. Suppose that X1 , X2. Durrett [77]. Suppose that X1 , X2 ,. See Durrett [77] for the proofs. Lowercase letters x1 ,. Some examples of statistics n are the sample mean, sample variance, etc. This estimate is called the empirical cumulative distribution function ecdf or empirical distribution function edf.

The ecdf of an observed sample x1 ,. A quantile of a distribution is found by inverting the cdf. R note 2. When quantiles are esti- mated, the density f is usually unknown, but 2.

For many problems, the MLE can be determined analytically. However, it is often the case that the optimization cannot be solved analytically, and in that case numerical optimization or other computational approaches can be applied.

Maximum likelihood estimators have an invariance property. Note that the maximum likelihood principle can also be applied in problems where the observed variables are not independent or identically distributed the likelihood function 2. Example 2. Suppose that x1 ,. From Example 2. The Bayesian approach views the unknown parameters of a distribution as random variables.

Thus, in Bayesian analysis, probabilities can be computed for parameters as well as the sample statistics. Then one is interested in comput- ing posterior quantities such as posterior means, posterior modes, posterior standard deviations, etc. Note that any constant in the likelihood function cancels out of the posterior density. However, Monte Carlo methods are available that do not require the evaluation of the constant in order to sample from the posterior distribution and estimate posterior quantities of interest.

Readers are referred to Lee [] for an introductory presentation of Bayesian statistics. Albert [5] is a good introduction to computational Bayesian meth- ods with R. A textbook covering probability and mathematical statistics from both a classical and Bayesian perspective at an advanced undergraduate level is DeGroot and Schervish [64].

Readers are referred to Ross [, Ch. Our goal is to generate a chain by simulation, so we consider discrete time Markov chains. The time index will be the nonnegative integers, so that the process starts in state X0 and makes successive transitions to X1 , X2 ,. The set of possible values of Xt is the state space. With- out loss of generality, we can suppose that the states are 0, 1, 2,. In other words, the transition probability depends only on the current state, and not on the past.

The probability that the chain moves from state i to state j in k k steps is pij , and the Chapman-Kolmogorov equations see e. A state i is recurrent if the chain returns to i with probability 1; otherwise state i is transient. The period of a state i is the greatest common divisor of the lengths of paths starting and ending at i.

In an irreducible chain, the periods of all states are equal, and the chain is aperiodic if the states all have period 1. Positive recurrent, aperiodic states are ergodic. A DNA nucleotide has four possible values. If it does change, then it is equally likely to change to any of the other three values.

Thus each row must sum to 1 the matrix is row stochastic. This matrix happens to be doubly stochastic because the columns also sum to 1, but in general a transition matrix need only be row stochastic.

All entries of P are positive, hence all states communicate; the chain is irre- ducible and ergodic. The stationary dis- tribution is the solution of equations 2. The state of the process at time n is the current location of the walker at time n. In the random walk model all states communicate, so the chain is irre- ducible. All states have period 2. For example, it is impossible to return to state 0 starting from 0 in an odd number of steps.

The symmetric random walk is discussed in Example 3. On this topic many excellent references are available. Therefore a suitable generator of uniform pseudo random numbers is essential.

Methods for generating random variates from other probability distributions all depend on the uniform random number generator. In this text we assume that a suitable uniform pseudo random number generator is available.

Refer to the help topic for. The uniform pseudo random number generator in R is runif. To gener- ate a vector of n pseudo random numbers between 0 and 1 use runif n. Throughout this text, whenever computer generated random numbers are mentioned, it is understood that these are pseudo random numbers. To gen- erate n random Uniform a, b numbers use runif n, a, b. In the examples of this chapter, several functions are given for generating random variates from continuous and discrete probability distributions.

Gen- erators for many of these distributions are available in R e. These methods are also applicable for external libraries, stand alone programs, or nonstandard simulation problems.

In some examples, histograms, density curves, or QQ plots are constructed. In other examples summary statistics such as sample moments, sample percentiles, or the em- pirical distribution are compared with the corresponding theoretical values.

These are informal approaches to check the implementation of an algorithm for simulating a random variable. Example 3. Before discussing those methods, however, it is useful to summarize some of the probability functions available in R. The probability mass function pmf or density pdf , cumu- lative distribution function cdf , quantile function, and random generator of many commonly used probability distributions are available.

A partial list of available probability distributions and parameters is given in Table 3. For a complete list, refer to the R documentation [, Ch. In addition to the parameters listed, some of the functions take optional log, lower. TABLE 3. The inverse transform method of generating random variables applies the probability integral transformation. The method is easy to apply, provided that FX is easy to compute. The method can be applied for generating continuous or discrete random variables.

The method can be summarized as follows. Derive the inverse function FX u. Write a command or function to compute FX u. For each random variate required: a Generate a random u from Uniform 0,1. Generate all n required random uniform numbers as vector u. Math annotation is covered in the help topic for plotmath.

Also see the help topics for text and axis. However, this algorithm is very useful for implementation in other situations, such as a C program. If X is a discrete random variable and. For each random variate required: 1. Generate a random u from Uniform 0,1.

See Devroye [69, Ch. Although there are simpler methods to generate a two point distribution in R, this example illustrates computing the inverse cdf of a discrete random variable in the simplest case. Our sample statistics are. Methods for Generating Random Variables 53 Example 3. The R function rpois generates random Poisson samples. To illustrate the main idea of the inverse transform method for generating Poisson variates, here is a similar example for which there is no R generator available: the logarithmic distribution.

The logarithmic distribution is a one parameter discrete distribution supported on the positive integers. A random variable X has the loga- rithmic distribution see [], Ch. Instead we compute the pmf from 3. In generating a large sample, there will be many repetitive calculations of the same values F x. If necessary, N will be increased. The code for logarithmic is on the next page.

Generate random samples from a Logarithmic 0. Then the acceptance-rejection method or rejec- tion method can be applied to generate the random variable X. The Acceptance-Rejection Method 1. Provide a method to generate random Y. For each random variate required: a Generate a random y from the distribution with density g. Hence, on average each sample value of X requires c iterations. Let g x be the Uniform 0,1 density. In the following simulation, the counter j for iterations is not necessary, but included to record how many iterations were actually needed to generate the beta variates.

Compare the empirical and theoretical percentiles. Larger numbers of replicates are required for estimation of percentiles where the density is close to zero. Some examples are 1. Generators based on transformations 5 and 6 are implemented in Exam- ples 3. Sums and mixtures are special types of transformations that are discussed in Section 3. This transformation determines an algorithm for generating random Beta a, b variates.

Generate a random u from Gamma a, 1. Generate a random v from Gamma b, 1. This method is applied below to generate a random Beta 3, 2 sample. If the sampled distribution is Beta 3, 2 , the QQ plot should be nearly linear. The QQ plot of the ordered sample vs the Beta 3, 2 quantiles in Figure 3. Generate u from Unif 0,1. Generate v from Unif 0,1. Below is a comparison of the Logarithmic 0.

The empirical probabilities p. R note 3. For other types of data, recode the data to positive integers or use table. If the data are not positive integers, tabulate will truncate real numbers and ignore without warning integers less than 1.

In this section we focus on sums of independent random variables convolutions and several examples of discrete and continuous mixtures. It is straightforward to simulate a convolution by directly generating X1 ,. Several distributions are related by convolution.

The negative binomial distribu- tion NegBin r, p is the convolution of r iid Geom p random variables. See Bean [23] for an introductory level presentation of these and many other interesting relationships between families of distributions.

In R it is of course easier to use the functions rchisq, rgeom and rnbinom to generate chisquare, geometric and negative binomial random samples. The following example is presented to illustrate a general method that can be applied whenever distributions are related by convolutions. Square each entry in the matrix 1. Compute the row sums of the squared normals. Deliver the vector of row sums. Here the standard errors of the sample moments are 0. The apply function applies a function to the margins of an array.

Notice that a loop is not used to compute the row sums. For row and column sums it is easier to use rowSums and colSums. Compare the methods for simulation of a convolution and a mixture of normal variables. To simulate the convolution: 1. Unlike the convolution above, the distribution of the mixture X is distinctly non-normal; it is bimodal.

To simulate the mixture: 1. In the following example we will compare simulated distributions of a con- volution and a mixture of gamma random variables. A list of all graphical parameters is returned by par. The method of generating the mixture in this example is simple for a mix- ture of two distributions, but not for arbitrary mixtures.

The next example illustrates how to generate a mixture of several distributions with arbitrary mixing probabilities. To simulate one random variate from the mixture FX : 1. To generate a sample size n, steps 1 and 2 are repeated n times. The algorithm can be translated into a vectorized approach. Generate a random sample k1 ,. Generate a gamma sample size n, with shape parameter r and rate vector rate use rgamma. The density curves in Figure 3. This example is similar to the previous one.

This example is a programming exercise that involves vectors of parameters and repeated use of the apply function. To produce the plot, we need a function to compute the density f x of the mixture. If x has length 1, dgamma x, 3, lambda is a vector the same length as lambda; in this case f1 x ,. The sum of this vector is the density of the mixture 3.

The code to produce the plot is listed below. The densities fk can be computed by the dgamma function. Since x is a vector, it does not have a dimension attribute by default. This example illustrates a method of sampling from a Poisson- Gamma mixture and compares the sample with the negative binomial distri- bution.

The corresponding R functions are eigen, chol, and svd. Usually, one does not apply a linear transformation to the random vectors of a sample one at a time. Typically, one applies the transformation to a data matrix and transforms the entire sample. Then the rows of Z are n random ob- servations from the d-dimensional standard MVN distribution. This saves a matrix multiplication. In this section each method of generating MVN random samples is illus- trated with examples.

Also note that there are functions provided in R pack- ages for generating multivariate normal samples. See the mvrnorm function in the MASS package [], and rmvnorm in the mvtnorm package []. In all of the examples below, the rnorm function is used to generate standard normal random variates. This method can also be called the eigen-decomposition method.

The scatter plot of the sample data shown in Figure 3. The Choleski factorization is implemented in the R function chol. Width Sepal. Length 0. Width 0. Choleski , mu, Sigma pairs X The pairs plot of the data in Figure 3. The joint distribution of each pair of marginal distributions is theoretically bivariate normal.

The iris virginica data are not multivariate normal, but means and correlation for each pair of variables should be similar to the simulated data. Length 6. Width 3. Length 2. Width 2. The parameters match the mean and covariance of the iris virginica data. Remark 3. The transformed d-dimensional sample then has zero mean vector and covariance Id.

This is not the same as scaling the columns of the data matrix. When several methods are available, which method is preferred? One consideration may be the computational time re- quired the time complexity.

Another important consideration, if the pur- pose of the simulation is to estimate one or more parameters, is the variance of the estimator.

The latter topic is considered in Chapter 5. To compare the empirical performance with respect to computing time, we can time each procedure. R provides the system. In the next example, the system. This example uses a function rmvnorm in the package mvtnorm []. The covariances used for this example are actually the sample covariances of standard multivariate normal samples. In order to time each method on the same covariance matrices, the random number seed is restored before each run.

The last run simply generates the covariances, for comparison with the total time. The Choleski method is somewhat faster, while rmvn. The similar performance of rmvn. The code not shown is similar to the examples above.

As the mixing parameter p and other parameters are varied, the multivariate normal mixtures have a wide variety of types of de- partures from normality.

Pa- rameters can be varied to generate a wide variety of distributional shapes. Johnson [] gives many examples for the bivariate normal mixtures. Many commonly applied statistical procedures do not perform well under this type of departure from normality, so normal mixtures are often chosen to compare the properties of competing robust methods of analysis. If X has the distribution 3.

The following procedure is equivalent. Use the mvrnorm MASS function [] to generate the multivariate normal observa- tions. We will eliminate the loop later. Generate a random permutation of the indices 1:n to indicate the order in which the sample observations appear in the data matrix. See Appendix B.

Methods for Generating Random Variables 79 more efficient version loc. All of the one dimensional marginal dis- tributions are univariate normal location mixtures. Methods for visualization of multivariate data are covered in Chapter 4. Also, an interesting view of a bivariate normal mixture with three components is shown in Figure Implementation is left as an exercise. Random vectors uniformly distributed on the d-sphere have equally likely directions.

A method of generating this distribution uses a property of the multivariate normal distribution see e. The ith row of M corresponds to to the ith random vector ui. Compute the denominator of 3. Deliver matrix U containing n random observations in rows.

See the help topic? Uniformly distributed points on a hyperellipsoid can be generated by ap- plying a suitable linear transformation to a Uniform sample on the d-sphere. Fishman [94, 3. The index set T could be discrete or continuous.

The set of possible values X t can take is the state space, which also can be discrete or continuous. Ross [] is an excellent introduction to stochastic processes, and includes a chapter on simulation.

Methods for Generating Random Variables 83 A counting process records the number of events or arrivals that occur by time t. A counting process has independent increments if the number of arrivals in disjoint time intervals are independent. A counting process has stationary increments if the number of events occurring in an interval depends only on the length of the interval. An example of a counting process is a Poisson process. The set of times of consecutive arrivals records the outcome and determines the state X t at any time t.

The interarrival times T1 , T2 ,. One method of simulating a Poisson process is to generate the interarrival times. It should be translated into vectorized operations, as shown in the next example. Suppose we need N 3 , the number of arrivals in [0, 3]. That is, given that the number of arrivals in 0, t is n, the arrival times S1 ,. Returning to Example 3. As a check, we estimate the mean and variance of N 3 from replications.

Here the sample mean and sample variance of the generated values N 3 are indeed very close to 6. In this case, the process needs to be simulated for a longer time than the value in upper. For example, if we need N t0 , one approach is to wrap the min which step with try and check that the result of try is an integer using is.

See the corresponding help topics for details. Actually, the second method is considerably slower by a factor of 4 or 5 than the previous method of Example 3. The rexp generator is almost as fast as runif, while the sort operation adds O n log n time. Some performance improvement might be gained if this algorithm is coded in C and a faster sorting algorithm designed for uniform numbers is used.

A nonhomogeneous Poisson process has independent increments but does not have stationary increments. Every nonhomogeneous Poisson process with a bounded intensity function can be obtained by time sampling a homogeneous Poisson process.

To see this, let N t be the number of accepted events in [0, t]. The steps to simulate the process on an interval [0, t0 are as follows. Methods for Generating Random Variables 87 Algorithm for simulating a nonhomogeneous Poisson process on an interval [0, t0 ] by sampling from a homogeneous Poisson process.

This is shown in the next example. This example is discussed in [, Sec. The process can be simulated by generating geometric interarrival times and computing the consecutive arrival times by the cumulative sum of interarrival times.

The plot is shown in Figure 3. The process has returned to 0 several times within time [1, ]. If the process has returned to the origin before time n, then to generate Sn we can ignore the past history up until the time the process most recently hit 0.

Then starting from the last return to the origin before time n, generate the increments Xi and sum them. Algorithm to simulate the state Sn of a symmetric random walk The following algorithm is adapted from [69, XIV. Let Wj be the waiting time until the j th return to the origin. Deliver si. The probability distribution of T [69, Thm.

The following methods are equivalent. Therefore, a generator can be written for values of T up to using the probability vector computed above. Suppose now that n is given and we need to compute the time of the last return to 0 in 0, n]. Here instead of issuing a warning, one could append to the vector and return a valid T. We leave that as an exercise. A better algorithm is suggested by Devroye [69, p. One run of the simulation above generates the times , , , , , , and that the process visits 0 uncomment the print statement to print the times.

Algorithms for generating random tours in general are discussed by Fishman [94, Ch. For a more theoretical treatment see Durrett [77, Ch. See Franklin [98] for simulation of Gaussian processes. Functions to simu- late long memory time series processes, including fractional Brownian motion are available in the R package fSeries see e. See Examples 2. Use the inverse transform method to generate a random sample of size from this distribution.

Use one of the methods shown in this chapter to compare the generated sample to the target distribution. Graph the density histogram of the sample with the Pareto 2, 2 density superimposed for comparison. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function. Generate a random sample of size from the Beta 3,2 distribution. Graph the histogram of the sample with the theoretical Beta 3,2 density superimposed.

Methods for Generating Random Variables 95 3. Compare the histogram with the lognormal density curve given by the dlnorm function in R. Write a function to generate random variates from fe , and construct the histogram density estimate of a large simulated random sample. Make a conjecture about the values of p1 that produce bimodal mixtures. Compare the empirical and theoretical Pareto distributions by graph- ing the density histogram of the sample and superimposing the Pareto density curve.

Use the R pairs plot to graph an array of scatter plots for each pair of variables. That is, transform the sample so that the sample mean vector is zero and sample covariance is the identity matrix. To check your results, generate multivariate normal samples and print the sample mean vector and covariance matrix before and after standardization.

Each row of the data frame is a set of scores xi1 ,. Standardize the scores by type of exam. That is, standard- ize the bivariate samples X1 , X2 closed book and the trivariate samples X3 , X4 , X5 open book. Compute the covariance matrix of the transformed sample of test scores. See Example 3. The game ends when either one of the players has all the money. Let Sn be the fortune of player A at time n. Estimate the mean and the variance of X 10 for several choices of the parameters and compare with the theoretical values.

Chapter 4 Visualization of Multivariate Data 4. Tukey [] believed that it was important to do the exploratory work before hypothesis testing, to learn what are the appropriate questions to ask, and the most appropriate methods to answer them. Here we restrict attention to methods for visualizing multivariate data. In this chapter several graphics functions are used. In addition to the R graphics package, which loads when R is started, other packages discussed in this chapter are lattice [] and MASS see [].

Also see the rggobi [] interface to GGobi and rgl [2] package for interactive 3D visualization. Table 4. Chapter 1 gives a brief summary of options for colors, plotting symbols, and line types. For example, a scatterplot matrix dis- plays the scatterplots for all pairs of variables in an array. The pairs function in the graphics package produces a scatterplot matrix, as shown in Figures 4.

An example of a panel display of three-dimensional plots is Figure 4. The pairs function takes an optional argument diag. For example, to obtain a graph with estimated density curves along the diagonal, supply the name of a function to plot the densities. The function below called panel. Before plotting, we apply the scale function to standardize each of the one-dimensional samples.

From the plot we can observe that the length variables are positively correlated, and the width variables appear to be positively correlated. Other structure could be present in the data that is not revealed by the bivariate marginal distributions. The lattice package [] provides functions to construct panel displays. Here we illustrate the scatterplot matrix function splom in lattice. It is displayed here in black and white, but on screen the panel display is easier to interpret when displayed in color plot 2.

Also see the 3D scatterplot of the iris data in Figure 4. For other types of panel displays, see the conditioning plots [42, 48, 49] implemented in coplot. Width 1. The persp graphics function draws perspective plots of surfaces over the plane.

Try running the demo examples for persp, to see many interesting graphs. The command is simply demo persp. We will also look at 3D methods in the lattice graphics package and the rgl package [, , 2]. The command for this is expand. Visualization of Multivariate Data Example 4. Most of the parameters are optional; x, y, z are required. For this function we need the complete grid of z values, but only one vector of x and one vector of y values.

The returned value is a matrix of function values for every point xi , yj in the grid. Storing the grid was not necessary. This transformation can be used to add elements to the plot. Example 4. Here we have shown the calculations. Other functions for graphing surfaces Surfaces can also be graphed using the wireframe lattice function []. The syntax for wireframe requires that x, y and z have the same number of rows. We can generate the matrix of x, y coordinates using expand.

If the rgl package is installed, run the demo. One of the examples in the demo shows a bivariate normal density. Actually, the data used to plot the surface in this demo is generated by smoothing simulated bivariate normal data. Chapter 10 gives examples of methods to construct and plot density esti- mates for bivariate data. Figures A possible application of this type of plot is to explore whether there are groups or clusters in the data.

The second part of the example illustrates several options. There are three species of iris and each is measured on four variables. The plot produced is similar to 3 in Figure 4. To see all four plots on the screen, use the more and split options. The split arguments determine the location of the plot within the panel display. The plots show that the three species of iris are separated into groups or clusters in the three dimensional subspaces spanned by any three of the four variables.

There is some structure evident in these plots. One might follow up with cluster analysis or principal components analysis to analyze the apparent structure in the data. Syntax for print cloud : To split the screen into n rows and m columns, and put the plot into position r, c , set split equal to the vector r, c, n, m.

See print. The functions contour graphics and contourplot lattice [] produce contour plots. The functions filled. A variation of this type of plot is image graphics , which uses color to identify contour levels. The data is an 87 by 61 matrix containing topographic information for the Maunga Whau volcano. It may also be interesting to see the 3D surface of the volcano for comparison with the contour plots.

A 3D view of the volcano surface is provided in the examples of the persp function. The R code for the example is in the persp help page. To run the example, type example persp. If the rgl package is installed, an interactive 3D view of the volcano appears in the examples. The image function in the graphics package provides the color background for the plot.

The plot produced below is similar to Figure 4. Using image without contour produces essentially the same type of plot as filled. The contours of filled. Compare the plot produced by image with the following two plots. The display on the screen will be in color. In this case, the 2D scatterplot does not reveal much information about the bivariate density.

The hexbin function in package hexbin [38] available from Bioconductor repository produces a basic version of this plot in grayscale, shown in Figure 4. Note that the darker colors correspond to the regions where the density is highest, and colors are increasingly lighter along radial lines extending from the mode near the origin.

The plot exhibits ap- proximately circular symmetry, consistent with the standard bivariate normal density. The bivariate histogram can also be displayed in 2D using a color palette, such as heat.

A similar type of plot is implemented in the gplots package []. The plot not shown resulting from the following code is similar to Figure 4. These include, among others, Andrews curves, parallel co- ordinate plots, and various iconographic displays such as segment plots and star plots. Queensland, Australia for two types of leaf architecture [] are represented by Andrews curves. The data set is leafshape17 in the DAAG package [, ]. Three measurements leaf length, petiole, and leaf width correspond to points in R3.

In general, this type of plot may reveal possible clustering of data. By default, the statistic is subtracted but other operations are possible. Then the ranges of each of the three columns in r are swept out; that is, each column is divided by its range. R note 4. The representation of vectors by parallel coordinates was in- troduced by Inselberg [] and applied for data analysis by Wegman []. Rather than represent axes as orthogonal, the parallel coordinate system represents axes as equidistant parallel lines.

Usually these lines are horizontal with common origin, scale, and orientation. Then to represent vectors in Rd , the parallel coordinates are simply the coordinates along the d copies of the real line.

Each coordinate of a vector is then plotted along its corresponding axis, and the points are joined together with line segments. Parallel coordinate plots are implemented by the parcoord function in the MASS package [] and the parallel function in the lattice package []. The parcoord function displays the axes as vertical lines. The panel function parallel displays the axes as horizontal lines. The crabs data frame has 5 measurements on each of crabs, from four groups of size The graph is best viewed in color.

Much of the variability between groups is in overall size. Adjusting the measurements of individual crabs for size may produce more interesting plots.

Following the suggestion in Venables and Ripley [] we adjust the measurements by the area of the carapace. The Andrews curves in Example 4. Andrews curves were displayed su- perimposed on the same coordinate system. Other representations as icons are best displayed in a table, so that features of observations can be compared. A tabular display does not have much practical value for high dimension or large data sets, but can be useful for some small data sets.

Some examples include star plots and segment plots. This type of plot is easily obtained in R using the stars graphics function. As in Example 4. The observations are labeled by species. The plot suggests, for example, that orange crabs have greater body depth relative to carapace width than blue crabs. The measurements have been adjusted by overall size of the individual crab. The two species are blue B and orange O. Principal components analysis similarly uses projections see e.

Dimension is reduced by projecting onto a small number of the principal com- ponents that collectively explain most of the variation. Pattern recognition and data mining are two broad areas of research that use some visualization methods.

See Ripley [] or Duda and Hart [75]. An interesting collection of topics on data mining and data visualization is found in Rao, Wegman, and Solka []. In addition to the R functions and packages mentioned in this chapter, several methods are available in other packages.

Again, here we only name a few. Mosaic plots for visualization of categorical data are available in mosaicplot. Also see the package vcd [] for visu- alization of categorical data. The functions prcomp and princomp provide principal components analysis. Many packages for R fall under the data min- ing or machine learning umbrella; for a start see nnet [], rpart [], and randomForest []. The rggobi [] package provides a command-line interface to GGobi, which is an open source visualization program for exploring high-dimensional data.

GGobi has a graphical user interface, providing dynamic and interactive graphics. Exercises 4. Visualization of Multivariate Data 4. Generate a bivariate random sample from the joint distribution of X, Y and construct a contour plot. Adjust the levels of the contours so that the the contours of the second mode are visible. Compare the plots before and after adjusting the measurements by the size of the crab.

Interpret the resulting plots. Set line type to identify leaf architecture as in Example 4. Compare with the plot in Figure 4. Produce Andrews curves for each of the six locations. Split the screen into six plotting areas, and display all six plots on one screen. Set line type or color to identify leaf architecture.

Display a segment style stars plot for leaf measurements at latitude 42 Tasmania. Repeat using the loga- rithms of the measurements. Another well known example is that W. Teams of scientists at the Los Alamos National Laboratory and many other researchers contributed to the early development, including Ulam, Richtmyer, and von Neumann [, ]. Specific acceptable prerequisites are listed below. Students will be expected to have reviewed the class notes prior to each class and will be expected to bring their laptops, loaded with R and the RStudio IDE to class.

Classes will have a four part structure: 1. A topic overview 2. Instructor demonstration 3. Student hands-on group-based project activity 4. Wrap-up and project review By the end of the course students should be able to code statistical functions in R. They should be able to extend the functionality of R by using add-on packages and they should be able to use R to perform the work-horse statistical tasks such as multiple-regression and simulation analyses.

These classes all take students to the level of multiple regression. The R statistical software program. There is no required textbook, though there are many optional texts available for students to refer to. These homeworks will be prescriptive and involve performing a set of programming related tasks in R.

The deliverables will be R code, output and related discussions. There is no final exam but rather a take home final project. Homework should be submitted to Canvas as text files for code and PDF files for output.

Specifically, there needs to be a problem definition and scoping stage. Data needs to be identified and read into the analysis platform.

The analysis occurs. Results are reported to interested parties. Using the help facility. Module 2 Data structures: vectors, matrices, lists and data frames. Module 3 Reading data into R from various data sources. Merging data across data sources. Module 4 Statistical modeling functions: lm and glm. Module 5 Writing your own functions I. Module 6 Writing your own functions II. Module 7 Iterating with R: logic and flow control.

Module 8 Simulation I. Module 9 Simulation II. Module 10 Extending R with add-on packages and the R ecosystem. Module 11 Graphics. Module 12 Dynamic and web reporting: Knitr and Shiny.

Running R as part of a business pipeline—the R terminal. This involves first of all installing R and RStudio. The basic functionality of R will be demonstrated. Using R for calculations. Using R to calculate summary statistics on data. Using R to generate random numbers. Variable types in R. Numeric variables, strings and factors. Accessing the help system. Data structures: vectors, matrices, lists and data frames R makes extensive use of various data structures. The core data structures are vectors, matrices, arrays, lists and dataframes.

We will discuss accessing elements of these data structures, sub-setting vectors, slicing arrays and drilling down on lists. We will also take a look at the apply and lapply functions, that allow you to apply functions to arrays and lists. Reading data into R from various data sources. Merging data across data sources R has many options for bringing in data for analysis. These include reading from flat files, reading from database connections and reading from web sources.

Many problems involve multiple data sources, so we will discuss merging data sources in R using the join command. Statistical modeling functions: lm and glm Linear and generalized linear models for example, logistic regression are the workhorses of modern analytics.

This class will illustrate the implementation of these functions in R and requires the use of the formula syntax for model specification. We will discuss prediction and model checking via residuals.



0コメント

  • 1000 / 1000