version
R 4.0.3

The subset function

You can subset a dataframe object by criteria using the subset function.

subset(mtcars, mpg > 25)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

You can combine individual criterion using boolean operators when selecting by row.

subset(mtcars, (mpg > 25) & (hp > 65)  )
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128      32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Fiat X1-9     27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

You can also subset by column. To select columns, add the select = parameter.

subset(mtcars, (mpg > 25) & (hp > 65) , select = c(mpg, cyl, hp))
               mpg cyl  hp
Fiat 128      32.4   4  66
Fiat X1-9     27.3   4  66
Porsche 914-2 26.0   4  91
Lotus Europa  30.4   4 113

Subset using indices

You’ve already learned how to identify elements in an atomic vector or a data frame in an earlier tutorial. For example, to extract the second and fourth element of the following vector,

x <- c("a", "f", "a", "d", "a")

type,

x[c(2,4)]
[1] "f" "d"

To subset a dataframe by index, define both dimension’s index. For example, to extract the first 20 rows and the first and forth columns of the built-in mtcars dataset type,

mtcars[ 1:20, c(1,4)]
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
Cadillac Fleetwood  10.4 205
Lincoln Continental 10.4 215
Chrysler Imperial   14.7 230
Fiat 128            32.4  66
Honda Civic         30.4  52
Toyota Corolla      33.9  65

Note that we have to add mtcars$ to the expression since the variable mpg does not exist as a standalone object.

You can also reference columns by names as in,

mtcars[ 1:20, c("mpg", "hp")]
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
Cadillac Fleetwood  10.4 205
Lincoln Continental 10.4 215
Chrysler Imperial   14.7 230
Fiat 128            32.4  66
Honda Civic         30.4  52
Toyota Corolla      33.9  65

Extracting using logical expression and indices

We can apply conditions to indices in identifying which elements of a vector or a table satisfy one or more criterion.

x[ x == "a" ]
[1] "a" "a" "a"

Let’s breakdown the above expression. The output of the expression x == "a" is TRUE FALSE TRUE FALSE TRUE, a logical vector with the same number of elements as x. The logical elements are then passed to the indexing brackets where they act as a “mask” as shown in the following graphic.

The elements that make it through the extraction mask are then combined into a new vector element.

The same operation can be applied to dataframes. For example, to extract all rows where mpg > 30 type:

mtcars[ mtcars$mpg > 30, ]
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Here, we are “masking” the rows that do not satisfy the criterion using the TRUE/FALSE logical outcomes from the conditional operation.

Replacing values using logical expressions

We can adopt the same masking properties of logical variables to replace values in a vector. For example, to replace all instances of a in vector x with z, we first expose the elements equal to a, then assign a new value to the exposed elements.

x[ x == "a" ] <- "z"
x
[1] "z" "f" "z" "d" "z"

You can think of the logical mask as a template applied to a street surface before spraying that template with a can of z spray. Only the exposed portion of the street surface will be sprayed with the z values.

You can apply this technique to dataframes as well. For example, to replace all elements in mpg with -1 if mpg < 25, type:

mtcars2 <- mtcars 
mtcars2[ mtcars2$mpg < 25, "mpg"]  <-  -1
mtcars2
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           -1.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       -1.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          -1.0   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      -1.0   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   -1.0   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             -1.0   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          -1.0   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           -1.0   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            -1.0   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            -1.0   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           -1.0   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          -1.0   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          -1.0   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         -1.0   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  -1.0   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental -1.0   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   -1.0   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       -1.0   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    -1.0   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         -1.0   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          -1.0   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    -1.0   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      -1.0   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        -1.0   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       -1.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          -1.0   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Note that we had to specify the column, "mpg", into which we are replacing the values. Had we left the second index empty, we would have replaced values across all columns.