readxl | Hmisc |
---|---|
1.3.1 | 4.4.2 |
Data files can be loaded from the R session’s working directory, from a directory structure relative to the working directory using the single dot .
or double dot ..
syntax, or (for some file types) directly from a website. The following sections will expose you to a mixture of data file environments. For a refresher on directory structures, review Understanding directory structures.
A popular data file format (and one that has withstood the test of time) is the text file format where columns are separated by a tab, space or comma. In the following example, R reads a comma delimited file called ACS.csv into a data object called dat
.
<- read.csv("ACS.csv", header=TRUE) dat
If the CSV file resides on a website, you can load the file directly from that site as follows:
<- read.csv("http://mgimond.github.io/ES218/Data/ACS.csv", header=TRUE) dat
Note that not all data file formats can be readily loaded directly from a website in a “read” function without additional lines of code. Examples are given in the next two sub-sections.
To read other text formats that use different delimiters invoke the command read.table()
and define the type of delimiter using the sep=
parameter. For example, to read a tab delimited data file called ACS.txt, run the command read.table("ACS.txt", sep="\t")
.
Note that if a number or a string is identified as being a placeholder for missing values in the data file, you can use the na.strings =
parameter in the read.csv
function. For example, assume that the word "missing"
was used in the csv file to denote a missing value, the function would be modified as follows:
<- read.csv("http://mgimond.github.io/ES218/Data/ACS.csv",
dat na.strings = "missing")
If more than one value is used as a placeholder for a missing value, you will need to combine the values using the c()
operator. For example, if in addition to the word "missing"
the value of -9999
was used to designate missing values, you would modify the above chunk of code as follows:
<- read.csv("http://mgimond.github.io/ES218/Data/ACS.csv",
dat na.strings = c("missing", "-9999") )
Note how the number is wrapped in double quotes. Also, note that the na.strings
parameter is applied to all columns in the dataframe. So if the word "missing"
or the number -9999
are valid values for some of the columns, you should not use this option. Instead, you would need to selectively replace the missing values after the dataset is loaded. This option will be explored in subsequent tutorials.
R has its own data file format–it’s usually saved using the .rds extension. To read a R data file, invoke the readRDS()
function.
<- readRDS("ACS.rds") dat
As with a CSV file, you can load a RDS file straight from a website, however, you must first run the file through a decompressor before attempting to load it via readRDS
. A built-in decompressor function called gzcon
can be used for this purpose.
<- readRDS(gzcon(url("http://mgimond.github.io/ES218/Data/ACS.rds"))) dat
The .rds file format is usually smaller than its text file counterpart and will therefore take up less storage space. The .rds file will also preserve data types and classes such as factors and dates eliminating the need to redefine data types after loading the file.
A package that does a good job in importing Excel files is readxl
. It recognizes most column formats defined by Excel including date formats. However, only one sheet can be loaded at a time. So if multiple Excel sheets are to be worked on, each sheet will need to be loaded into separate dataframe objects.
If you don’t have the readxl
package installed, install the package as you would any other package via RStudio’s interface or in R using the following command:
install.packages("readxl")
In this example, we will load an Excel data sheet called Discharge
which tabulates daily river water discharge. The sample file, Discharge_2004_2014.xlsx
, can be downloaded here. The following chunk of code assumes that the Excel file is saved in a folder called Data/
inside the R session’s working directory.
library(readxl)
<- read_excel("./Data/Discharge_2004_2014.xlsx", sheet = "Discharge") xl
Note that the single dot .
that precedes the Data/
folder name instructs R to look for the Data/
folder in the current working directory. A later example will show you how to instruct R to look for files and folders outside of the current working directory using two dots ..
instead of one.
An advantage to using this package for loading Excel files is its ability to preserve data types–including date formatted columns! In the above example, the Excel file has a column called Date
which stores the month/day/year data as a date object. We can check that the loaded xl
object recognizes the Date
column as a date
data type:
str(xl)
tibble [3,866 x 3] (S3: tbl_df/tbl/data.frame)
$ Date : POSIXct[1:3866], format: "2004-06-01" "2004-06-02" "2004-06-03" "2004-06-04" ...
$ Discharge: num [1:3866] 6170 6590 6210 7120 6990 6160 5570 4500 4940 4550 ...
$ Code : chr [1:3866] "A" "A" "A" "A" ...
The Date
column is defined as a POSIXct
data type; this is the computer’s way of storing dates as the number of seconds since some internal reference date. We would therefore not need to convert the date column as would be the case if the date column was loaded from a CSV file. If such was the case, then the date column would most likely be loaded as a character or factor data type. A more in-depth discussion on date objects and their manipulation in R is covered in the next chapter.
Excel files can be loaded directly from the web using the following chunk of code:
<- "http://mgimond.github.io/ES218/Data/Discharge_2004_2014.xlsx"
web.file <- tempfile(fileext=".xlsx")
tmp download.file(web.file,destfile=tmp, mode="wb")
<- read_excel(tmp, sheet = "Discharge") xl
Instead of downloading the file into virtual memory, R needs to download the file into a temporary folder before it can open it. However, that temporary file my not be available in a later session, so you will probably need to reload the data if you launch a new R session.
It’s usually recommended that a data file be stored as a CSV or tab delimited file format if compatibility across software platforms is desired. However, you might find yourself in a situation where you have no option but to import data stored in a proprietary format. This requires the use (and installation) of a package called Hmisc
. The package will convert the following file formats: SAS (XPT format), SPSS (SAV format) and Stata (dta format). You can install the package on your computer as follows:
install.packages("Hmisc")
In this example, a SAS file of blood pressure from the CDC will be loaded into an object called dat
(file documentation can be found here).
library(Hmisc)
<- sasxport.get("http://personal.colby.edu/personal/m/mgimond/R/Data/BPX_G.xpt") dat
Likewise, to import an SPSS file, use the spss.get()
function; and to import a STATA file, use the stata.get()
function.
To export a data object called dat.sub
as a comma delimited file in a folder called Data/
residing at a level above the R session’s working directory, run the following:
write.csv(dat.sub, "../Data/ACS_sub.csv")
Note that the two dots ..
tells the function to get out of the current folder (i.e. moving up a directory) before proceeding down into the Data/
folder.
To export a data object called dat.sub
to a .Rds (R) file format in a folder called Data/
residing at a level above the R session’s working directory, run the following:
saveRDS(dat.sub, "../Data/ACS_sub.rds")
You can save an entire R session (which includes all data objects) using the save
function.
To save all objects, set the list=
parameter to ls()
:
save(list=ls(), file = "../Data/ACS_all.Rdata")
To save only two R session objects–dat
and dat.sub
–to a file, pass the list of objects to the list=
parameter:
save(list=c(dat, dat.sub), file = "../Data/ACS_subset.Rdata")
To load a previously saved R session type:
load("../Data/ACS_all.Rdata")