How To View All Variables In R
Organising Data in R
A tutorial about data analysis using R
Dr Jon Yearsley (School of Biology and Environmental Science, UCD)
- Objectives
- Introduction
- Viewing a data frame
- Adding a variable
- Changing a variable's data type
- Removing a variable
- Set missing data to
NA
- Subset of a data frame
- Saving data
- Summary of the topics covered
- Further Reading
How to Read this Tutorial
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
# This is a chunk of R code. All text after a # symbol is a comment # Set working directory using setwd() function setwd('Enter the path to my working directory') # Clear all variables in R's memory rm(list= ls()) # Standard code to clear R's memory
Sometimes the output from running this R code will be displayed after the chunk of code. R output will be preceeded by ##.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
## [1] 6
Objectives
The objectives of this tutorial are:
- Introduce the concept of a data frame
- Demonstrate how data frames can be manipulated
- Demonstrate how to reformat data and code for missing data
- Explain data subsetting in R
- Save imported data to a compact binary file
Introduction
This tutorial will show you how to view, subset and manipulate data frames within R. This assumes that the data have been successfully imported into R (if you are unsuccessful at importing data into R then you need to read the data importing worksheet).
The data we'll be using have been imported from these files:
- WOLF.CSV: This file is a text file of comma separated variables.
- INSECT.TXT:This file is a text file of TAB delimited variables.
These data sets are described at http://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html
Viewing a data frame
Finding variable names
Use the ls()
function to print a list of variables in R's memory
ls() # Display the variables in R's memory
## [1] "files" "files.sheet2" "insect" "wolf"
A poor way to view data
Typing the name of a variable will display all the data contained in the variable.
insect # Display the entire insect data frame
## Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F X X.1 ## 1 10 11 0 3 3 11 NA NA ## 2 7 17 1 5 5 9 NA NA ## 3 20 21 7 12 3 15 NA NA ## 4 14 11 2 6 5 22 NA NA ## 5 14 16 3 4 3 15 NA NA ## 6 12 14 1 3 6 16 NA NA ## 7 10 17 2 5 1 13 NA NA ## 8 23 17 1 5 1 10 NA NA ## 9 17 19 3 5 3 26 NA NA ## 10 20 21 0 5 2 26 NA NA ## 11 14 7 1 2 6 24 NA NA ## 12 13 13 4 4 4 13 NA NA
BEWARE: Printing out the entire data set is rarely useful, because data sets are often too large to fit on a computer screen (for example, the wolf data frame has 178 rows of data, making it hard to read in one go). There are often better ways to view a data frame than to just print out the entire variable.
Good ways to view data
Here are some options for viewing data frames:
head(wolf) # Display the first 6 lines of the wolf data frame tail(wolf, n= 10) # Display the last 10 lines of the wolf data frame summary(wolf) # Display an overview of the wolf data frame str(wolf) # Display the structure of the wolf data frame
The summary()
function is particularly useful. It displays summary statistics for each variable in a data frame. Later we will see how the summary()
function has many uses, such as displaying summary results from a data analysis.
The summary output for a data frame depends upon a variable's data type.
- For quantitative data (
num
andint
) the summary shows the minimum, first quartile (25% quantile), the mean, the median (50% quantile or second quartile), the third quartile (75% quantile), the maximum and the number of missing values (missing values are represented asNA
in R). Examples of numerical data in thewolf
data frame Cpgmg, Tpgmg and Ppgmg. - For qualitative data (
factor
,logi
) the summary shows first five categories of a qualitative variable and the number of data points in each category. Any remaining categories are lumped together as(Other)
. The number of missing values are also shown. Examples of qualitative data in thewolf
data frame are Sex and Colour. - For plain text data that isn't qualitative the summary displays the type of data (
Class : character
).
The data type of a variable (e.g. quantitative, qualitative, character) is displayed in the output from the str()
function.
Viewing part of a data frame
Using $ and a variable's name
A single variable (column) in a data frame can be specified by giving the name of the data frame, followed by a $
followed by the name of the variable.
Here is a example that specifies just the cortisol data in the wolf
data frame
wolf$Cpgmg # Display just the cortisol data
The names of the variables can be seen at the top of each column of data (for example, using the head()
function)
# Variable names appear above each column of data head(wolf) # Display first 6 rows of data.
## Individual Sex Population Colour Cpgmg Tpgmg Ppgmg ## 1 1 M 2 W 15.86 5.32 NA ## 2 2 F 1 D 20.02 3.71 14.37622 ## 3 3 F 2 W 9.95 5.30 21.65902 ## 4 4 F 1 D 25.22 3.71 13.42507 ## 5 5 M 2 D 21.13 5.34 NA ## 6 6 M 2 W 12.48 4.60 NA
Adding a variable
We can add a variable to a data frame using the $
operator.
Here is an example where we add the variable Replicate
(1-12) which codes for each replicate of an experimental treatment
insect$Replicate = c(1 : 12) # Add a variable called Replicate to the data frame
head(insect) # Display the first 6 rows of the trimmed data frame
## Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F X X.1 Replicate ## 1 10 11 0 3 3 11 NA NA 1 ## 2 7 17 1 5 5 9 NA NA 2 ## 3 20 21 7 12 3 15 NA NA 3 ## 4 14 11 2 6 5 22 NA NA 4 ## 5 14 16 3 4 3 15 NA NA 5 ## 6 12 14 1 3 6 16 NA NA 6
Changing a variable's data type
Data in statistical analyses are often one of two basic data types: quantitative or qualitative data.
- R calls a continuous quantitative variable numeric (abbreviated to
num
) - R calls a discrete quantitative variable integer (abbreviated to
int
) - R calls a qualitative variable a factor
A qualitative variable is a set of labels (e.g. large, medium and small). Each label is called a level of the factor.
R also has other data types. Some examples are:
- character data type = plain text (abbreviated to
chr
) - logical data type = a variable that is TRUE or FALSE (abbreviated to
logi
)
In the wolf data frame the variables Population, Individual, Sex and Colour are qualitative (the labels from each of these variables identify a data point to a population, an individual, a sex and a coat colour, respectively).
The data types that R has assigned each variable can be seen by looking at the structure of the wolf data frame
str(wolf) # Display the structure of the data frame
## 'data.frame': 178 obs. of 7 variables: ## $ Individual: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Sex : chr "M" "F" "F" "F" ... ## $ Population: int 2 1 2 1 2 2 1 1 1 2 ... ## $ Colour : chr "W" "D" "W" "D" ... ## $ Cpgmg : num 15.86 20.02 9.95 25.22 21.13 ... ## $ Tpgmg : num 5.32 3.71 5.3 3.71 5.34 4.6 4.58 9.27 4.81 5.07 ... ## $ Ppgmg : num NA 14.4 21.7 13.4 NA ...
You can see some issues here:
- The variables Population and Individual have not been assigned as quantitative variables (R has identified them as numerical integers,
int
, because the wolf.csv file used whole numbers as labels for these two variables).
- The variables Sex and Colour have been identified as containing text (
chr
type), but we want these to be recognised as qualitative nominal data types (R calls this data type afactor
). The variable Sex has two levels 'M' and 'F'. The variable Colour also has two levels 'D', 'W', and blank should be explicitly recognised as missing data.
We want to redefine the variables Population, Sex and Colour so that R recognizes it as a factor (unorded factor). We will also redefine the variable Individual to be plain text (i.e. a character) to demonstrate the as.character()
function.
# Convert Population variable from numeric to a factor (a qualitative variable) wolf$Population = as.factor(wolf$Population) # Convert Sex variable from character to a factor (a qualitative variable) wolf$Sex = as.factor(wolf$Sex) # Convert Colour variable from character to a factor (a qualitative variable) wolf$Colour = as.factor(wolf$Colour) # Convert Individual variable from numeric to plain text wolf$Individual = as.character(wolf$Individual) # Display an overview of the data frame summary(wolf)
## Individual Sex Population Colour Cpgmg Tpgmg ## Length:178 F:72 1: 45 : 30 Min. : 4.75 Min. : 3.140 ## Class :character M:76 2:103 D: 37 1st Qu.:12.16 1st Qu.: 4.372 ## Mode :character U:30 3: 30 W:111 Median :15.61 Median : 5.070 ## Mean :17.74 Mean : 6.148 ## 3rd Qu.:20.35 3rd Qu.: 6.317 ## Max. :73.19 Max. :61.790 ## ## Ppgmg ## Min. :12.76 ## 1st Qu.:19.50 ## Median :25.00 ## Mean :25.89 ## 3rd Qu.:30.01 ## Max. :53.28 ## NA's :109
Notice how the summary of the variables Population, Sex, Individual and Individual have changed now that they are factors. Also note that missing values, NA's, are explicitly taken into account when summarizing the data (e.g. the variable Ppgmg).
There are a set of related functions for coercing variables into other data types. Here are some examples
as.factor(...) # Coerces a variable to be a factor (qualitative, nominal) as.numeric(...) # Coerces a variable to be numeric (quantitative, continuous) as.character(...) # Coerces a variable to be a character (qualitative, unordered)
Removing a variable
Sometimes we want to remove a variable from a data frame.
The insect
data frame has two variables that should not be part of the data set (X
and X.1
). This is quite common when importing data. In this case the reason is two additional TABs at the end of each line in the text file. These TABs are hard to see, but R recognized them, created two additional variables and named them with default labels.
The columns can be removed by first finding out how many rows and columns the data frame has and then removing the last two columns. Here is the code
ncol(insect) # Number of columns in data frame nrow(insect) # Number of rows in data frame dim(insect) # Display number of rows and columns insect = insect[ ,- c(7,8)] # Remove the last two columns
Set missing data to NA
Always use
NA
to represent missing data
Data on coat colour is missing for population 3. R explicitly represents missing data as NA
, but the WOLF.CSV data file uses a blank space to represent missing data.
The code below sets these blank spaces to NA
# Create a logical variable that is TRUE if an observation is from population 3 bool.index = wolf$Population== 3 # Set coat colour variable to be NA for observations from population 3 wolf$Colour[bool.index] = NA
Subset of a data frame
Selecting observations (rows) from a data frame
To select only particular rows from a data frame using a criterion you can use the subset
function.
For example, to make a subset of the data in wolf
that contains only females,
wolf.F = subset(wolf, Sex== 'F') # Create a subset with data on female wolves
Another way to subset the data frame using a logical index:
# Create a logical variable which is TRUE if an observation is for a female bool.index = wolf$Sex== 'F' # Create a subset containing only data on female wolves wolf.F2 = wolf[bool.index, ]
Make a subset using several variables
# Create a subset containing only data on female wolves in Population 1 # method 1: wolf.F3 = subset(wolf, Sex== 'F' & Population== 1)
# Create a subset containing only data on female wolves in Population 1 # method 2: bool.index = wolf$Sex== 'F' & wolf$Population== 1 wolf.F4 = wolf[bool.index,]
Another example using a logical OR (|
)
# Create a subset containing only data on wolves in Population 1 OR Population 2 wolf.F5 = subset(wolf, Population== 1 | Population== 2) summary(wolf.F5)
## Individual Sex Population Colour Cpgmg Tpgmg ## Length:148 F:72 1: 45 : 0 Min. : 4.75 Min. : 3.250 ## Class :character M:76 2:103 D: 37 1st Qu.:12.16 1st Qu.: 4.378 ## Mode :character U: 0 3: 0 W:111 Median :15.38 Median : 5.030 ## Mean :16.61 Mean : 5.617 ## 3rd Qu.:19.98 3rd Qu.: 6.067 ## Max. :40.43 Max. :15.130 ## ## Ppgmg ## Min. :12.76 ## 1st Qu.:19.50 ## Median :25.00 ## Mean :25.89 ## 3rd Qu.:30.01 ## Max. :53.28 ## NA's :79
Dropping empty levels of a factor
The subset wolf.F5
contains no data from population 3, but the factor Population still has 3 levels. To remove unwanted levels from a factor use the function droplevels()
Using the droplevels()
function on the data frame wolf.F5
will remove the level for population 3, as well as any other levels that contain no data (e.g. wolves with an undetermined sex, level U of variable Sex)
wolf.F5 = droplevels(wolf.F5) # Update the levels of factors in wolf.F5 summary(wolf.F5) # The factor Population now has 2 levels
## Individual Sex Population Colour Cpgmg Tpgmg ## Length:148 F:72 1: 45 D: 37 Min. : 4.75 Min. : 3.250 ## Class :character M:76 2:103 W:111 1st Qu.:12.16 1st Qu.: 4.378 ## Mode :character Median :15.38 Median : 5.030 ## Mean :16.61 Mean : 5.617 ## 3rd Qu.:19.98 3rd Qu.: 6.067 ## Max. :40.43 Max. :15.130 ## ## Ppgmg ## Min. :12.76 ## 1st Qu.:19.50 ## Median :25.00 ## Mean :25.89 ## 3rd Qu.:30.01 ## Max. :53.28 ## NA's :79
Selecting variables (columns) from a data frame
The subset command can be used to extract one or more variables from a data frame. For example, to select only the cortisol (Cpgmg
) and Population
variables from the wolf
data frame (these are the third and fifth columns in the data frame)
# Create a subset of the data containing the variables 'Population' and 'Cpgmg' wolf.subset1 = subset(wolf, select= c('Population','Cpgmg'))
Other ways to select variables from a data frame
# Create a subset of the data containing the variables 'Population' and 'Cpgmg' wolf.subset2 = wolf[,c('Population','Cpgmg')] # Create a subset of the data containing the variables 'Population' and 'Cpgmg' # (columns 3 and 5 in the wolf data frame) wolf.subset3 = wolf[,c(3,5)] # Create a subset of the data containing the variable 'Population' # using the variable name wolf$Population
Variables (columns) and observations (rows) can be selected at the same time. Here is an example selecting data on population identity and cortisol for just female wolves
# Create a subset of the data containing only female wolves and the # variables 'Population' and 'Cpgmg' wolf.subset4 = subset(wolf, Sex== 'F', select= c('Population','Cpgmg'))
Saving data
Large data sets can be time consuming to import into R. Once a file has been imported it is a good idea to save the data in R's native binary format. Data in this format is quick to import and takes up less space on the hard drive. By convention, files containing data in R's binary format have the suffix .Rdata
.
To save the variables wolf
, insect.tidy
and bees
to a file use the save()
command
# Save wolf, insect.tidy and bees to a file called 'sheet2_data.Rdata' save(wolf, insect, file= 'sheet2_data.Rdata')
We can verify that the data have been correctly saved by clearing R's memory and re-importing them using the load()
command. Try running the following commands to see if you can reload the data saved in file sheet2_data.Rdata
.
rm(list= ls()) # Clear variables from memory ls() # Display the variables in R's memory load(file= 'sheet2_data.Rdata') # Import R binary data from a file ls() # Display the variables in R's memory
Summary of the topics covered
- Displaying contents of a data frame
- Manipulating data in a data frame
- Creating subset of data
- Saving a data frame to a file using R's binary data file format
- Reading data from an R binary data file
Further Reading
All these books can be found in UCD's library
- Andrew P. Beckerman and Owen L. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford University Press, Oxford) [Chapter 3]
- Michael J. Crawley, 2015 Statistics : an introduction using R (John Wiley & Sons, Chichester) [Chapter 2]
- Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)
How To View All Variables In R
Source: https://www.ucd.ie/ecomodel/Resources/Sheet2b_dataframe_in_R_WebVersion.html
Posted by: martineznevard.blogspot.com
0 Response to "How To View All Variables In R"
Post a Comment