version | |
---|---|
R | 4.0.3 |
subset
functionYou can subset a dataframe object by criteria using the subset
function.
subset(mtcars, mpg > 25)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
You can combine individual criterion using boolean operators when selecting by row.
subset(mtcars, (mpg > 25) & (hp > 65) )
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
You can also subset by column. To select columns, add the select =
parameter.
subset(mtcars, (mpg > 25) & (hp > 65) , select = c(mpg, cyl, hp))
mpg cyl hp
Fiat 128 32.4 4 66
Fiat X1-9 27.3 4 66
Porsche 914-2 26.0 4 91
Lotus Europa 30.4 4 113
You’ve already learned how to identify elements in an atomic vector or a data frame in an earlier tutorial. For example, to extract the second and fourth element of the following vector,
<- c("a", "f", "a", "d", "a") x
type,
c(2,4)] x[
[1] "f" "d"
To subset a dataframe by index, define both dimension’s index. For example, to extract the first 20 rows and the first and forth columns of the built-in mtcars
dataset type,
1:20, c(1,4)] mtcars[
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
Hornet 4 Drive 21.4 110
Hornet Sportabout 18.7 175
Valiant 18.1 105
Duster 360 14.3 245
Merc 240D 24.4 62
Merc 230 22.8 95
Merc 280 19.2 123
Merc 280C 17.8 123
Merc 450SE 16.4 180
Merc 450SL 17.3 180
Merc 450SLC 15.2 180
Cadillac Fleetwood 10.4 205
Lincoln Continental 10.4 215
Chrysler Imperial 14.7 230
Fiat 128 32.4 66
Honda Civic 30.4 52
Toyota Corolla 33.9 65
Note that we have to add mtcars$
to the expression since the variable mpg
does not exist as a standalone object.
You can also reference columns by names as in,
1:20, c("mpg", "hp")] mtcars[
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
Hornet 4 Drive 21.4 110
Hornet Sportabout 18.7 175
Valiant 18.1 105
Duster 360 14.3 245
Merc 240D 24.4 62
Merc 230 22.8 95
Merc 280 19.2 123
Merc 280C 17.8 123
Merc 450SE 16.4 180
Merc 450SL 17.3 180
Merc 450SLC 15.2 180
Cadillac Fleetwood 10.4 205
Lincoln Continental 10.4 215
Chrysler Imperial 14.7 230
Fiat 128 32.4 66
Honda Civic 30.4 52
Toyota Corolla 33.9 65
We can apply conditions to indices in identifying which elements of a vector or a table satisfy one or more criterion.
== "a" ] x[ x
[1] "a" "a" "a"
Let’s breakdown the above expression. The output of the expression x == "a"
is TRUE FALSE TRUE FALSE TRUE
, a logical vector with the same number of elements as x
. The logical elements are then passed to the indexing brackets where they act as a “mask” as shown in the following graphic.
The elements that make it through the extraction mask are then combined into a new vector element.
The same operation can be applied to dataframes. For example, to extract all rows where mpg > 30
type:
$mpg > 30, ] mtcars[ mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Here, we are “masking” the rows that do not satisfy the criterion using the TRUE
/FALSE
logical outcomes from the conditional operation.
We can adopt the same masking properties of logical variables to replace values in a vector. For example, to replace all instances of a
in vector x
with z
, we first expose the elements equal to a
, then assign a new value to the exposed elements.
== "a" ] <- "z"
x[ x x
[1] "z" "f" "z" "d" "z"
You can think of the logical mask as a template applied to a street surface before spraying that template with a can of z
spray. Only the exposed portion of the street surface will be sprayed with the z
values.
You can apply this technique to dataframes as well. For example, to replace all elements in mpg
with -1
if mpg < 25
, type:
<- mtcars
mtcars2 $mpg < 25, "mpg"] <- -1
mtcars2[ mtcars2 mtcars2
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 -1.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag -1.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 -1.0 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive -1.0 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout -1.0 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant -1.0 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 -1.0 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D -1.0 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 -1.0 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 -1.0 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C -1.0 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE -1.0 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL -1.0 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC -1.0 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood -1.0 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental -1.0 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial -1.0 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona -1.0 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger -1.0 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin -1.0 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 -1.0 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird -1.0 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L -1.0 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino -1.0 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora -1.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E -1.0 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Note that we had to specify the column, "mpg"
, into which we are replacing the values. Had we left the second index empty, we would have replaced values across all columns.