**An
introduction to data analysis using R**

George F. Hart, Ph. D.,

Professor emeritus, LSU.

**What does this eBook cover**

**This eBook concentrates on
understanding the basic statistical procedures used to describe data
and to analyze simple problems using the NORMAL distribution. It is
for those who have little or no knowledge of statistics and is
intended to provide a working knowledge in a short time. The eBook
avoids mathematics proofs and uses R to perform the necessary
procedures used in data-analysis, concentrating on understanding what
the statistics are doing, why they are doing it, and what the results
indicate about the data. **

**Table of Contents**

This part is a basic course in statistics and graphics avoiding mathematical proofs and formula. It uses R code and R procedures to outline the fundamental knowledge needed to use statistics.

What is not covered

Getting R to work on your computer

Libraries

Data-frames for this course

Importing a data-frame into R

Data within R

Importing from a vector or list

Importing from a matrix

Importing data from a data-base

Exporting a data-frame out of R

As text output

As a spreadsheet

As a table

As a graphic

Saving objects in R

Saving a function in R

Saving your commands

Basic descriptive statistics

Median

Mode

Mean

Variance

Standard deviation

Coefficient of variation

Maximum – Minimum - Range

The z-score

Mid-range

Percentile

Quartile

Skewness

Kurtosis

Summary numbers used in descriptive statistics

Summary information using Contingency Tables

Error

Type 1 eerror

Type 2 error

The nature of data

Percentages, counts and other data measures

Scales of measurement

Nominal data

Dispersion

Association

Ordinal data

Distribution Overview

Central Tendency

Dispersion

Position

Association

Interval data

Distribution Overview

Central Tendency

Dispersion

Position

Ratio data

Distribution Overview

Central tendency

Dispersion

Position

Attribute types

Populations

Spatial variability and sampling a population

Random Sampling

Stratified Sampling

Cluster Sampling

Systematic Sampling

The single sample problem in geological / palaeo-biological studies

The single line problem in geological / palaeo-biological studies

The dual line problem in geological / palaeo-biological studies

The multiple line problem in geological / palaeo-biological studies

Coloring Graphics

Histogram plots

Box plot

Bag plot

Scatter plot

Index plot

Q-Q plot

Residuals plot

Frequency polygon

Pie chart

Pairs plot

Coplot

3-D plot

Transformations

The Box-Cox Normality Plot

Using the regression equation for prediction

Fitting a confidence interval

The variety of distributions

The box-plot eye ball method

Chebyshev's Theorem

Shapiro-Wilk test for normality

The D'agnostino test

The Kolmogorov-Smirnov Goodness-of-fit test

Comparing the sample data-frame with a random distribution

Fitting other distributions

Outliers

Assumptions

Degrees of freedom [df]

Equal population variances

Z-test [ used for interval and ratio data analysis]

T-test [interval, ratio data-analysis]

Independent [unpaired] t-test

Dependent [paired] t-test

Calculating a single probability value

Confidence Interval [CI]

Setting a confidence interval around a t distributed variable

Setting a confidence interval around a normally distributed variable

Calculating the power of the test, assuming standard deviation is known

Correlation coefficients

Equitability

Comparative diversity indices

Similarity

What this part covers. This section is designed for those who analyze samples taken from the natural environment [random samples] as opposed to sampling from designed experiments i.e. it emphasizes sampling designed analysis, not experimentally designed problems. Sampling designed problems are those associated with samples taken from the natural environment.

ANOVA and regression are complimentary methods for data analysis, in that regression creates a model of reality and ANOVA evaluates the model. Both examine a dependent [response] variable to determine the variability of the dependent variable as a response to factors [predictor or independent variables].

Linear Regression

Simple regression

Multiple regression

The Box-Cox procedure

This part is designed for those who analyze data from designed experiments in which a controlled set of samples are used, and for those who apply experimental design in the collection of sample data-frames.

This part is designed for those who analyze data that need similarity
or difference measurement, including clustering techniques.

The commonly used statistical procedures for classification are of two kinds.
When the groups [classes] are known and the problem is to classify unknowns into
one or another of the groups the procedure of choice is discriminant function
analysis. Characterization of the groups uses the MANOVA procedure. When the
groups are unknown but a cloud of data points exist which need to be separated
into groups the procedure is factor analysis followed by discriminant function
analysis.

This part is designed for those who analyze spatial data that need to be mapped in a geographical coordinate system

This part is designed for those who already understand how to produce basic graphics in R and want to understand how to use ggplot2 to produce specialized and individualized graphs.