The dataset used in this tutorial was downloaded from NOAA’ NDBC buoy data center in a space delimited file format. It consists of hourly air and ocean physical measurements such as temperature, wind speed and wave height for 2012 measured at NOAA’s buoy #44005 located in the Gulf of Maine (43.204 N, 69.128 W).
Type the following to load the data file into your current R session (note that we are using the read.table
function instead of the read.csv
function since the data are not comma delimited):
<- read.table("http://mgimond.github.io/ES218/Data/buoy_44005_2012.dat", header=TRUE) dat
Since read.table
defaults to a space separated format (the same format our data file is stored in), there is no need to specify the sep=
parameter.
First, let’s extract the number of rows and columns in our table:
dim(dat)
[1] 8347 18
dat
consists of 8347 rows and 18 columns.
We can extract the names of each column using the names
function:
names(dat)
[1] "YY" "MM" "DD" "hh" "mm" "WDIR" "WSPD" "GST" "WVHT" "DPD"
[11] "APD" "MWD" "PRES" "ATMP" "WTMP" "DEWP" "VIS" "TIDE"
A brief description of each column follows:
Field | Description |
---|---|
YY | Year |
MM | Month |
DD | Day |
hh | Hour |
mm | Minute |
WDIR | Wind direction (the direction the wind is coming from in degrees clockwise from true N) |
WSPD | Wind speed, averaged over an eight-minute period (m/s) |
GST | Peak 5 or 8 second gust speed measured during the eight-minute or two-minute period(m/s) |
WVHT | Significant wave height (meters) |
DPD | Dominant wave period (seconds) |
APD | Average wave period (seconds) |
MWD | The direction from which the waves at the dominant period (DPD) are coming. The units are degrees from true North, increasing clockwise, with North as 0 (zero) degrees and East as 90 degrees. |
PRES | Sea level pressure (hPa) |
ATMP | Air temperature (Celsius) |
WTMP | Sea surface temperature (Celsius) |
DEWP | Dewpoint temperature (Celsius) |
VIS | visibility (nautical miles) |
TIDE | Water level in feet above or below Mean Lower Low Water, MLLW (feet) |
For large datasets, it’s sometimes helpful to display just the first few lines of the table to get a sense of the kind of data we are dealing with. We’ll use the head
function to do this and have R return the first five records of the table (n=5
).
head( dat, n=5)
YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS
1 2011 12 31 23 50 316 4.8 6.0 0.92 5.56 4.88 999 1008.1 6.8 8.1 6.8 99
2 2012 1 1 0 50 296 5.5 6.3 0.89 5.56 4.60 999 1007.8 6.1 8.1 6.1 99
3 2012 1 1 1 50 311 8.4 10.0 0.87 8.33 4.09 999 1008.2 5.5 8.1 5.5 99
4 2012 1 1 2 50 314 8.3 10.0 0.93 3.45 4.05 999 1008.0 5.4 8.1 5.4 99
5 2012 1 1 3 50 313 11.1 13.0 1.12 3.85 3.88 999 1008.0 5.8 8.1 5.8 99
TIDE
1 99
2 99
3 99
4 99
5 99
We can also display the last few lines of the table using the tail
function.
tail( dat, n =5)
YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP
8343 2012 12 14 1 50 999 99 99 0.31 9.09 5.15 999 1026.3 5.6 9.7 -4.1
8344 2012 12 14 2 50 999 99 99 0.33 9.09 4.69 999 1025.5 5.6 9.7 -4.2
8345 2012 12 14 3 50 999 99 99 0.36 12.90 4.04 999 1025.0 5.6 9.7 -4.2
8346 2012 12 14 4 50 999 99 99 0.36 12.90 3.87 999 1024.6 5.6 9.7 -4.0
8347 2012 12 14 5 50 999 99 99 0.37 8.33 3.66 999 1023.8 5.6 9.6 -3.9
VIS TIDE
8343 99 99
8344 99 99
8345 99 99
8346 99 99
8347 99 99
To view the data type/class associated with each variable in the table, we’ll use the str
function.
str(dat)
'data.frame': 8347 obs. of 18 variables:
$ YY : int 2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ MM : int 12 1 1 1 1 1 1 1 1 1 ...
$ DD : int 31 1 1 1 1 1 1 1 1 1 ...
$ hh : int 23 0 1 2 3 4 5 6 7 8 ...
$ mm : int 50 50 50 50 50 50 50 50 50 50 ...
$ WDIR: int 316 296 311 314 313 319 319 318 332 322 ...
$ WSPD: num 4.8 5.5 8.4 8.3 11.1 11.1 11.1 10.5 9.8 8.2 ...
$ GST : num 6 6.3 10 10 13 13.3 13.2 12.5 11.5 9.6 ...
$ WVHT: num 0.92 0.89 0.87 0.93 1.12 1.25 1.48 1.46 1.41 1.23 ...
$ DPD : num 5.56 5.56 8.33 3.45 3.85 4.76 4.55 5 5 5 ...
$ APD : num 4.88 4.6 4.09 4.05 3.88 4.06 4.31 4.39 4.31 4.15 ...
$ MWD : int 999 999 999 999 999 999 999 999 999 999 ...
$ PRES: num 1008 1008 1008 1008 1008 ...
$ ATMP: num 6.8 6.1 5.5 5.4 5.8 5.8 6 6.5 6.8 6.7 ...
$ WTMP: num 8.1 8.1 8.1 8.1 8.1 8 8 8.1 8.1 8.1 ...
$ DEWP: num 6.8 6.1 5.5 5.4 5.8 5.7 5.8 5.5 5.1 5.2 ...
$ VIS : num 99 99 99 99 99 99 99 99 99 99 ...
$ TIDE: num 99 99 99 99 99 99 99 99 99 99 ...
All values are stored as numbers with about a third of those stored as integers, int
, and the others as double precision, num
.
Now let’s generate a descriptive summary of the table.
summary(dat)
YY MM DD hh mm
Min. :2011 Min. : 1.000 Min. : 1.0 Min. : 0.00 Min. :50
1st Qu.:2012 1st Qu.: 3.000 1st Qu.: 8.0 1st Qu.: 5.00 1st Qu.:50
Median :2012 Median : 6.000 Median :15.0 Median :11.00 Median :50
Mean :2012 Mean : 6.234 Mean :15.4 Mean :11.49 Mean :50
3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:17.50 3rd Qu.:50
Max. :2012 Max. :12.000 Max. :31.0 Max. :23.00 Max. :50
WDIR WSPD GST WVHT
Min. : 1.0 Min. : 0.0 Min. : 0.00 Min. : 0.260
1st Qu.:143.0 1st Qu.: 3.9 1st Qu.: 4.80 1st Qu.: 0.740
Median :213.0 Median : 6.2 Median : 7.40 Median : 1.070
Mean :234.3 Mean :10.5 Mean :11.81 Mean : 1.624
3rd Qu.:285.0 3rd Qu.: 8.9 3rd Qu.:10.70 3rd Qu.: 1.580
Max. :999.0 Max. :99.0 Max. :99.00 Max. :99.000
DPD APD MWD PRES
Min. : 2.600 Min. : 3.120 Min. :999 Min. : 981.3
1st Qu.: 5.560 1st Qu.: 4.410 1st Qu.:999 1st Qu.:1009.2
Median : 7.140 Median : 5.000 Median :999 Median :1014.9
Mean : 7.982 Mean : 5.551 Mean :999 Mean :1017.9
3rd Qu.: 9.090 3rd Qu.: 5.830 3rd Qu.:999 3rd Qu.:1020.5
Max. :99.000 Max. :99.000 Max. :999 Max. :9999.0
ATMP WTMP DEWP VIS TIDE
Min. :-11.00 Min. : 5.40 Min. :-16.000 Min. :99 Min. :99
1st Qu.: 5.60 1st Qu.: 7.10 1st Qu.: 0.900 1st Qu.:99 1st Qu.:99
Median : 10.40 Median :11.60 Median : 8.600 Median :99 Median :99
Mean : 10.78 Mean :12.17 Mean : 8.268 Mean :99 Mean :99
3rd Qu.: 16.80 3rd Qu.:16.30 3rd Qu.: 15.300 3rd Qu.:99 3rd Qu.:99
Max. : 25.10 Max. :24.30 Max. :999.000 Max. :99 Max. :99
You’ll note that many columns in our data contain values of 99 or 999. Such values are often placeholders for missing data. If not dealt with properly, these values may be interpreted as valid data in subsequent analyses. One tell-tale sign that these values should be treated specially is their extreme values compared to the rest of the data batch. For example, the wave height field, WVHT
, has values that average less than 2 meters so a value of 99 meters should be flagged as suspicious. In fact, NOAA’s documentation states that “Missing data … are denoted by … a variable number of 9’s … depending on the data type (for example: 999.0 99.0).”
We also note that some columns have nothing but missing values (e.g. VIS
and TIDE
) suggesting that either these variables are not measured at this particular buoy or the instruments did not function properly during the data collection period.
Before proceeding with any analysis, we need to address the missing values by flagging them as such or removing them all together.
Let’s first remove all columns devoid of valid data. These columns are MWD
, VIS
and TIDE
. There are many ways to do this. We will opt for subsetting the table and assigning this subset to a new data frame we will call dat1
.
<- dat[ , !(names(dat) %in% c("MWD", "VIS", "TIDE"))] dat1
Let’s break down this expression into its many components. names(dat)
returns a list of column names in the order in which they appear in the object dat
: YY, MM, DD, hh, mm, WDIR, WSPD, GST, WVHT, DPD, APD, MWD, PRES, ATMP, WTMP, DEWP, VIS, TIDE.
We use the match operator, %in%
, to evaluate if the column names are included in the set of elements defined in c("MWD", "VIS", "TIDE")
. The expression could be read as “… are any of the column names (i.e. YY, MM, DD, hh, mm, WDIR, WSPD, GST, WVHT, DPD, APD, MWD, PRES, ATMP, WTMP, DEWP, VIS, TIDE) included in the set ("MWD", "VIS", "TIDE")
?” The output of this comparison is a series of TRUE
’s and FALSE
’s depending on whether or not a match is found.
names(dat) %in% c("MWD", "VIS", "TIDE")
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[13] FALSE FALSE FALSE FALSE TRUE TRUE
The order in which the TRUE
’s and FALSE
’s are listed match the order of the input column names. This order is important because this index is what R uses to identify which columns to return. But wait, R will return columns associated with a TRUE
value! We want the reverse, so we add the NOT operator, !
, in front of the expression thus flipping the TRUE
/FALSE
values.
!(names(dat) %in% c("MWD", "VIS", "TIDE"))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE FALSE FALSE
So dat1
contains all the columns except MWD
, VIS
and TIDE
.
head(dat1)
YY MM DD hh mm WDIR WSPD GST WVHT DPD APD PRES ATMP WTMP DEWP
1 2011 12 31 23 50 316 4.8 6.0 0.92 5.56 4.88 1008.1 6.8 8.1 6.8
2 2012 1 1 0 50 296 5.5 6.3 0.89 5.56 4.60 1007.8 6.1 8.1 6.1
3 2012 1 1 1 50 311 8.4 10.0 0.87 8.33 4.09 1008.2 5.5 8.1 5.5
4 2012 1 1 2 50 314 8.3 10.0 0.93 3.45 4.05 1008.0 5.4 8.1 5.4
5 2012 1 1 3 50 313 11.1 13.0 1.12 3.85 3.88 1008.0 5.8 8.1 5.8
6 2012 1 1 4 50 319 11.1 13.3 1.25 4.76 4.06 1008.8 5.8 8.0 5.7
NA
to missing dataIt’s usually best to assign NA
values the moment a data file is loaded into R. For example, had all NA
values been flagged as -9999
, we could have added the parameter na.strings="-9999"
to the read.table()
function. But our dataset contains missing values flagged as 99, 999 and 9999 requiring that we combine the missing string values such as na.strings = c("99", "999", "9999")
however, this would have resulted in undesirable outcomes. For example, a wave height of 99 meters would clearly not make sense, so flagging such value as NA
would be sensible, but a wind direction of 99 degrees would make sense and flagging such value as NA
would not be appropriate. Had we set all values of 99 to NA
the moment we loaded the file into R, we would have inadvertently removed valid values in the dataset. So flagging missing values as NA
in our dataset will require column level intervention.
Let’s review the data summary to identify each column’s no data numeric designation (i.e. 99 or 999 or 9999) if present. Remember that we are now working off of the dat1
dataframe and not the original dat
dataframe.
summary(dat1)
YY MM DD hh mm
Min. :2011 Min. : 1.000 Min. : 1.0 Min. : 0.00 Min. :50
1st Qu.:2012 1st Qu.: 3.000 1st Qu.: 8.0 1st Qu.: 5.00 1st Qu.:50
Median :2012 Median : 6.000 Median :15.0 Median :11.00 Median :50
Mean :2012 Mean : 6.234 Mean :15.4 Mean :11.49 Mean :50
3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:17.50 3rd Qu.:50
Max. :2012 Max. :12.000 Max. :31.0 Max. :23.00 Max. :50
WDIR WSPD GST WVHT
Min. : 1.0 Min. : 0.0 Min. : 0.00 Min. : 0.260
1st Qu.:143.0 1st Qu.: 3.9 1st Qu.: 4.80 1st Qu.: 0.740
Median :213.0 Median : 6.2 Median : 7.40 Median : 1.070
Mean :234.3 Mean :10.5 Mean :11.81 Mean : 1.624
3rd Qu.:285.0 3rd Qu.: 8.9 3rd Qu.:10.70 3rd Qu.: 1.580
Max. :999.0 Max. :99.0 Max. :99.00 Max. :99.000
DPD APD PRES ATMP
Min. : 2.600 Min. : 3.120 Min. : 981.3 Min. :-11.00
1st Qu.: 5.560 1st Qu.: 4.410 1st Qu.:1009.2 1st Qu.: 5.60
Median : 7.140 Median : 5.000 Median :1014.9 Median : 10.40
Mean : 7.982 Mean : 5.551 Mean :1017.9 Mean : 10.78
3rd Qu.: 9.090 3rd Qu.: 5.830 3rd Qu.:1020.5 3rd Qu.: 16.80
Max. :99.000 Max. :99.000 Max. :9999.0 Max. : 25.10
WTMP DEWP
Min. : 5.40 Min. :-16.000
1st Qu.: 7.10 1st Qu.: 0.900
Median :11.60 Median : 8.600
Mean :12.17 Mean : 8.268
3rd Qu.:16.30 3rd Qu.: 15.300
Max. :24.30 Max. :999.000
It appears that the fields with missing values flagged as 99
are WSPD
, GST
, WVHT
, DPD
and APD
; fields with missing values flagged as 999
are WDIR
and DEWP
; and the field with missing values flagged as 9999
is PRES
. We will convert these values to missing values in the following chunk of code.
# Create list of columns with missing values
<- c("WSPD", "GST", "WVHT", "DPD", "APD")
na99 <- c("WDIR", "DEWP")
na999 <- ("PRES")
na9999
# Replace missing values with NA
== 99] <- NA
dat1[ ,na99][dat1[ ,na99] == 999] <- NA
dat1[ ,na999][dat1[ ,na999] == 9999] <- NA dat1[ ,na9999][dat1[ ,na9999]
The above expression is quite succinct, but succinctness does not always translate to easy readability. There are two components to the last three lines of code: The first is a subset of columns for which a particular missing value is designated (e.g. dat1[,na99]
), the second is an evaluation statement identifying which values in the column subset is flagged as missing (e.g. [dat1[, na99] == 99]
). Together, these elements define the indices (table cells) for which a value is replaced with NA
(e.g. <- NA
).
Another way this could have been accomplished (and one you might feel more comfortable with) would have been to work off of one column at a time, for example:
$WSPD[dat1$WSPD == 99] <- NA
dat1$GST[dat1$GST == 99] <- NA
dat1$WVHT[dat1$WVHT == 99] <- NA
dat1$DPD[dat1$DPD == 99] <- NA
dat1$APD[dat1$APD == 99] <- NA
dat1
$WDIR[dat1$WDIR == 999] <- NA
dat1$DEWP[dat1$DEWP == 999] <- NA
dat1
$PRES[dat1$PRES == 9999] <- NA dat1
This chunk of code is easier to read, but lacks the succinctness of its predecessor.
Let’s look at a summary of our table:
summary(dat1)
YY MM DD hh mm
Min. :2011 Min. : 1.000 Min. : 1.0 Min. : 0.00 Min. :50
1st Qu.:2012 1st Qu.: 3.000 1st Qu.: 8.0 1st Qu.: 5.00 1st Qu.:50
Median :2012 Median : 6.000 Median :15.0 Median :11.00 Median :50
Mean :2012 Mean : 6.234 Mean :15.4 Mean :11.49 Mean :50
3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:17.50 3rd Qu.:50
Max. :2012 Max. :12.000 Max. :31.0 Max. :23.00 Max. :50
WDIR WSPD GST WVHT
Min. : 1.0 Min. : 0.000 Min. : 0.000 Min. :0.260
1st Qu.:136.0 1st Qu.: 3.800 1st Qu.: 4.700 1st Qu.:0.740
Median :208.0 Median : 6.000 Median : 7.200 Median :1.060
Mean :198.2 Mean : 6.315 Mean : 7.684 Mean :1.285
3rd Qu.:273.0 3rd Qu.: 8.500 3rd Qu.:10.200 3rd Qu.:1.570
Max. :360.0 Max. :18.800 Max. :23.300 Max. :8.650
NA's :377 NA's :377 NA's :377 NA's :29
DPD APD PRES ATMP
Min. : 2.600 Min. : 3.120 Min. : 981.3 Min. :-11.00
1st Qu.: 5.560 1st Qu.: 4.410 1st Qu.:1009.2 1st Qu.: 5.60
Median : 7.140 Median : 5.000 Median :1014.9 Median : 10.40
Mean : 7.664 Mean : 5.225 Mean :1014.6 Mean : 10.78
3rd Qu.: 9.090 3rd Qu.: 5.820 3rd Qu.:1020.5 3rd Qu.: 16.80
Max. :17.390 Max. :10.430 Max. :1040.2 Max. : 25.10
NA's :29 NA's :29 NA's :3
WTMP DEWP
Min. : 5.40 Min. :-16.000
1st Qu.: 7.10 1st Qu.: 0.900
Median :11.60 Median : 8.600
Mean :12.17 Mean : 7.793
3rd Qu.:16.30 3rd Qu.: 15.300
Max. :24.30 Max. : 23.900
NA's :4
Note the NA
’s in the summary output. We now have a clean dataframe to work with.
Cleaning data tables is an important component of any analytical workflow, but the syntax used thus far can get complicated and be difficult to follow. Fortunately, there is an easier way to clean data tables using the dplyr
package which will be covered in Week 3.