Select (and optionally rename) variables in a data frame, using a concise
mini-language that makes it easy to refer to variables based on their name
(e.g. a:f
selects all columns from a
on the left to f
on the
right) or type (e.g. where(is.numeric)
selects all numeric columns).
Tidyverse selections implement a dialect of R where operators make it easy to select variables:
:
for selecting a range of consecutive variables.
!
for taking the complement of a set of variables.
&
and |
for selecting the intersection or the union of two
sets of variables.
c()
for combining selections.
In addition, you can use selection helpers. Some helpers select specific columns:
everything()
: Matches all variables.
last_col()
: Select last variable, possibly with an offset.
group_cols()
: Select all grouping columns.
Other helpers select variables by matching patterns in their names:
starts_with()
: Starts with a prefix.
ends_with()
: Ends with a suffix.
contains()
: Contains a literal string.
matches()
: Matches a regular expression.
num_range()
: Matches a numerical range like x01, x02, x03.
Or from variables stored in a character vector:
all_of()
: Matches variable names in a character vector. All
names must be present, otherwise an out-of-bounds error is
thrown.
any_of()
: Same as all_of()
, except that no error is thrown
for names that don't exist.
Or using a predicate function:
where()
: Applies a function to all variables and selects those
for which the function returns TRUE
.
# S3 method for class 'SingleCellExperiment'
select(.data, ...)
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
<tidy-select
> One or more unquoted
expressions separated by commas. Variable names can be used as if they
were positions in the data frame, so expressions like x:y
can
be used to select a range of variables.
An object of the same type as .data
. The output has the following
properties:
Rows are not affected.
Output columns are a subset of input columns, potentially with a different
order. Columns will be renamed if new_name = old_name
form is used.
Data frame attributes are preserved.
Groups are maintained; you can't select off grouping variables.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages:
(ChipDb
, GODb
, OrgDb
, OrthologyDb
, ReactomeDb
), dbplyr (tbl_lazy
), dplyr (data.frame
), plotly (plotly
), tidySingleCellExperiment (SingleCellExperiment
)
.
Here we show the usage for the basic selection operators. See the
specific help pages to learn about helpers like starts_with()
.
The selection language can be used in functions like
dplyr::select()
or tidyr::pivot_longer()
. Let's first attach
the tidyverse:
Select variables by name:
starwars %>% select(height)
#> # A tibble: 87 x 1
#> height
#> <int>
#> 1 172
#> 2 167
#> 3 96
#> 4 202
#> # i 83 more rows
iris %>% pivot_longer(Sepal.Length)
#> # A tibble: 150 x 6
#> Sepal.Width Petal.Length Petal.Width Species name value
#> <dbl> <dbl> <dbl> <fct> <chr> <dbl>
#> 1 3.5 1.4 0.2 setosa Sepal.Length 5.1
#> 2 3 1.4 0.2 setosa Sepal.Length 4.9
#> 3 3.2 1.3 0.2 setosa Sepal.Length 4.7
#> 4 3.1 1.5 0.2 setosa Sepal.Length 4.6
#> # i 146 more rows
Select multiple variables by separating them with commas. Note how the order of columns is determined by the order of inputs:
starwars %>% select(homeworld, height, mass)
#> # A tibble: 87 x 3
#> homeworld height mass
#> <chr> <int> <dbl>
#> 1 Tatooine 172 77
#> 2 Tatooine 167 75
#> 3 Naboo 96 32
#> 4 Tatooine 202 136
#> # i 83 more rows
Functions like tidyr::pivot_longer()
don't take variables with
dots. In this case use c()
to select multiple variables:
iris %>% pivot_longer(c(Sepal.Length, Petal.Length))
#> # A tibble: 300 x 5
#> Sepal.Width Petal.Width Species name value
#> <dbl> <dbl> <fct> <chr> <dbl>
#> 1 3.5 0.2 setosa Sepal.Length 5.1
#> 2 3.5 0.2 setosa Petal.Length 1.4
#> 3 3 0.2 setosa Sepal.Length 4.9
#> 4 3 0.2 setosa Petal.Length 1.4
#> # i 296 more rows
The :
operator selects a range of consecutive variables:
starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # i 83 more rows
The !
operator negates a selection:
starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 blond fair blue 19 male masculine Tatooine Human
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid
#> 3 <NA> white, blue red 33 none masculine Naboo Droid
#> 4 none white yellow 41.9 male masculine Tatooine Human
#> # i 83 more rows
#> # i 3 more variables: films <list>, vehicles <list>, starships <list>
iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#> Sepal.Width Petal.Width Species
#> <dbl> <dbl> <fct>
#> 1 3.5 0.2 setosa
#> 2 3 0.2 setosa
#> 3 3.2 0.2 setosa
#> 4 3.1 0.2 setosa
#> # i 146 more rows
iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <fct>
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> # i 146 more rows
&
and |
take the intersection or the union of two selections:
iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Width
#> <dbl>
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> # i 146 more rows
iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#> Petal.Length Petal.Width Sepal.Width
#> <dbl> <dbl> <dbl>
#> 1 1.4 0.2 3.5
#> 2 1.4 0.2 3
#> 3 1.3 0.2 3.2
#> 4 1.5 0.2 3.1
#> # i 146 more rows
To take the difference between two selections, combine the &
and
!
operators:
data(pbmc_small)
pbmc_small |> select(cell, orig.ident)
#> Warning: tidySingleCellExperiment says: from version 1.3.1, the special columns including cell id (colnames(se)) has changed to ".cell". This dataset is returned with the old-style vocabulary (cell), however, we suggest to update your workflow to reflect the new vocabulary (.cell).
#> # A SingleCellExperiment-tibble abstraction: 80 × 9
#> # Features=230 | Cells=80 | Assays=counts, logcounts
#> cell orig.ident PC_1 PC_2 PC_3 PC_4 PC_5 tSNE_1 tSNE_2
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ATGCCAGAACGACT SeuratProj… -0.774 -0.900 -0.249 0.559 0.465 0.868 -8.10
#> 2 CATGGCCTGTGCAT SeuratProj… -0.0260 -0.347 0.665 0.418 0.585 -7.39 -8.77
#> 3 GAACCTGATGAACC SeuratProj… -0.457 0.180 1.32 2.01 -0.482 -28.2 0.241
#> 4 TGACTGGATTCTCA SeuratProj… -0.812 -1.38 -1.00 0.139 -1.60 16.3 -11.2
#> 5 AGTCAGACTGCACA SeuratProj… -0.774 -0.900 -0.249 0.559 0.465 1.91 -11.2
#> 6 TCTGATACACGTGT SeuratProj… -0.774 -0.900 -0.249 0.559 0.465 3.15 -9.94
#> 7 TGGTATCTAAACAG SeuratProj… -0.460 -1.19 -0.312 0.716 -1.65 17.9 -9.90
#> 8 GCAGCTCTGTTTCT SeuratProj… -0.900 -0.388 0.693 0.404 0.536 -6.49 -8.39
#> 9 GATATAACACGCAT SeuratProj… -0.774 -0.900 -0.249 0.559 0.465 1.33 -9.68
#> 10 AATGTTGACAGTCA SeuratProj… -0.488 -1.16 -0.306 0.702 -1.47 17.0 -9.43
#> # ℹ 70 more rows