Given a regular expression with capturing groups, extract()
turns
each group into a new column. If the groups don't match, or the input
is NA, the output will be NA.
pivot_longer()
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is
pivot_wider()
Learn more in vignette("pivot")
.
pivot_wider()
"widens" data, increasing the number of columns and
decreasing the number of rows. The inverse transformation is
pivot_longer()
.
Learn more in vignette("pivot")
.
Convenience function to paste together multiple columns into one.
Given either a regular expression or a vector of character positions,
separate()
turns a single character column into multiple columns.
See tidyr::unnest
See tidyr::unnest
See tidyr::unnest
tidyr::unnest
tidyr::unnest
See tidyr::unnest
A tbl. (See tidyr)
See ?tidyr::nest
Names of new variables to create as character vector.
Use NA
to omit the variable in the output.
a regular expression used to extract the desired values.
There should be one group (defined by ()
) for each element of into
.
If TRUE
, will run type.convert()
with
as.is=TRUE
on new columns. This is useful if the component
columns are integer, numeric or logical.
NB: this will cause string "NA"
s to be converted to NA
s.
<tidy-select
> Columns to pivot into
longer format.
When pivoting cols
into longer format, how should the
output rows be arranged relative to their original row number?
"fastest"
, the default, keeps individual rows from cols
close
together in the output. This often produces intuitively ordered output
when you have at least one key column from data
that is not involved in
the pivoting process.
"slowest"
keeps individual columns from cols
close together in the
output. This often produces intuitively ordered output when you utilize
all of the columns from data
in the pivoting process.
A character vector specifying the new column or columns to
create from the information stored in the column names of data
specified
by cols
.
If length 0, or if NULL
is supplied, no columns will be created.
If length 1, a single column will be created which will contain the
column names specified by cols
.
If length >1, multiple columns will be created. In this case, one of
names_sep
or names_pattern
must be supplied to specify how the
column names should be split. There are also two additional character
values you can take advantage of:
NA
will discard the corresponding component of the column name.
".value"
indicates that the corresponding component of the column
name defines the name of the output column containing the cell values,
overriding values_to
entirely.
If names_to
contains multiple values,
these arguments control how the column name is broken up.
names_sep
takes the same specification as separate()
, and can either
be a numeric vector (specifying positions to break on), or a single string
(specifying a regular expression to split on).
names_pattern
takes the same specification as extract()
, a regular
expression containing matching groups (()
).
If these arguments do not give you enough control, use
pivot_longer_spec()
to create a spec object and process manually as
needed.
What happens if the output has invalid column names?
The default, "check_unique"
is to error if the columns are duplicated.
Use "minimal"
to allow duplicates in the output, or "unique"
to
de-duplicated by adding numeric suffixes. See vctrs::vec_as_names()
for more options.
A string specifying the name of the column to create
from the data stored in cell values. If names_to
is a character
containing the special .value
sentinel, this value will be ignored,
and the name of the value column will be derived from part of the
existing column names.
If TRUE
, will drop rows that contain only NA
s
in the value_to
column. This effectively converts explicit missing values
to implicit missing values, and should generally be used only when missing
values in data
were created by its structure.
Optionally, a list of column
name-function pairs. Alternatively, a single function can be supplied,
which will be applied to all columns. Use these arguments if you need to
change the types of specific columns. For example, names_transform = list(week = as.integer)
would convert a character variable called week
to an integer.
If not specified, the type of the columns generated from names_to
will
be character, and the type of the variables generated from values_to
will be the common type of the input columns used to generate them.
Optionally, a list of column name-prototype
pairs. Alternatively, a single empty prototype can be supplied, which will
be applied to all columns. A prototype (or ptype for short) is a
zero-length vector (like integer()
or numeric()
) that defines the type,
class, and attributes of a vector. Use these arguments if you want to
confirm that the created columns are the types that you expect. Note that
if you want to change (instead of confirm) the types of specific columns,
you should use names_transform
or values_transform
instead.
<tidy-select
> A set of columns that
uniquely identify each observation. Typically used when you have
redundant variables, i.e. variables whose values are perfectly correlated
with existing variables.
Defaults to all columns in data
except for the columns specified through
names_from
and values_from
. If a tidyselect expression is supplied, it
will be evaluated on data
after removing the columns specified through
names_from
and values_from
.
Should the values in the id_cols
columns be expanded by
expand()
before pivoting? This results in more rows, the output will
contain a complete expansion of all possible values in id_cols
. Implicit
factor levels that aren't represented in the data will become explicit.
Additionally, the row values corresponding to the expanded id_cols
will
be sorted.
<tidy-select
> A pair of
arguments describing which column (or columns) to get the name of the
output column (names_from
), and which column (or columns) to get the
cell values from (values_from
).
If values_from
contains multiple values, the value will be added to the
front of the output column.
If names_from
or values_from
contains multiple
variables, this will be used to join their values together into a single
string to use as a column name.
String added to the start of every variable name. This is
particularly useful if names_from
is a numeric vector and you want to
create syntactic variable names.
Instead of names_sep
and names_prefix
, you can supply
a glue specification that uses the names_from
columns (and special
.value
) to create custom column names.
Should the column names be sorted? If FALSE
, the default,
column names are ordered by first appearance.
When names_from
identifies a column (or columns) with
multiple unique values, and multiple values_from
columns are provided,
in what order should the resulting column names be combined?
"fastest"
varies names_from
values fastest, resulting in a column
naming scheme of the form: value1_name1, value1_name2, value2_name1, value2_name2
. This is the default.
"slowest"
varies names_from
values slowest, resulting in a column
naming scheme of the form: value1_name1, value2_name1, value1_name2, value2_name2
.
Should the values in the names_from
columns be expanded
by expand()
before pivoting? This results in more columns, the output
will contain column names corresponding to a complete expansion of all
possible values in names_from
. Implicit factor levels that aren't
represented in the data will become explicit. Additionally, the column
names will be sorted, identical to what names_sort
would produce.
Optionally, a (scalar) value that specifies what each
value
should be filled in with when missing.
This can be a named list if you want to apply different fill values to different value columns.
Optionally, a function applied to the value in each cell
in the output. You will typically use this when the combination of
id_cols
and names_from
columns does not uniquely identify an
observation.
This can be a named list if you want to apply different aggregations
to different values_from
columns.
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by id_cols
,
names_from
, or values_from
).
The default drops all unused columns from the result.
This can be a named list if you want to apply different aggregations to different unused columns.
id_cols
must be supplied for unused_fn
to be useful, since otherwise
all unspecified columns will be considered id_cols
.
This is similar to grouping by the id_cols
then summarizing the
unused columns using unused_fn
.
A data frame.
The name of the new column, as a string or symbol.
This argument is passed by expression and supports
quasiquotation (you can unquote strings
and symbols). The name is captured from the expression with
rlang::ensym()
(note that this kind of interface where
symbols do not represent actual objects is now discouraged in the
tidyverse; we support it here for backward compatibility).
<tidy-select
> Columns to unite
If TRUE
, missing values will be remove prior to uniting
each value.
If TRUE
, remove input columns from output data frame.
Separator between columns.
If character, sep
is interpreted as a regular expression. The default
value is a regular expression that matches any sequence of
non-alphanumeric values.
If numeric, sep
is interpreted as character positions to split at. Positive
values start at 1 at the far-left of the string; negative value start at -1 at
the far-right of the string. The length of sep
should be one less than
into
.
If sep
is a character vector, this controls what
happens when there are too many pieces. There are three valid options:
"warn" (the default): emit a warning and drop extra values.
"drop": drop any extra values without a warning.
"merge": only splits at most length(into)
times
If sep
is a character vector, this controls what
happens when there are not enough pieces. There are three valid options:
"warn" (the default): emit a warning and fill from the right
"right": fill with missing values on the right
"left": fill with missing values on the left
A tidySummarizedExperiment objector a tibble depending on input
A tidySummarizedExperiment objector a tibble depending on input
A tidySummarizedExperiment objector a tibble depending on input
A tidySummarizedExperiment objector a tibble depending on input
A tidySummarizedExperiment objector a tibble depending on input
A tidySummarizedExperiment objector a tibble depending on input
pivot_longer()
is an updated approach to gather()
, designed to be both
simpler to use and to handle more use cases. We recommend you use
pivot_longer()
for new code; gather()
isn't going away but is no longer
under active development.
pivot_wider()
is an updated approach to spread()
, designed to be both
simpler to use and to handle more use cases. We recommend you use
pivot_wider()
for new code; spread()
isn't going away but is no longer
under active development.
separate()
to split up by a separator.
pivot_wider_spec()
to pivot "by hand" with a data frame that
defines a pivotting specification.
separate()
, the complement.
unite()
, the complement, extract()
which uses regular
expression capturing groups.
tidySummarizedExperiment::pasilla %>%
nest(data=-condition) %>%
unnest(data)
#> # A SummarizedExperiment-tibble abstraction: 102,193 × 5
#> # Features=14599 | Samples=7 | Assays=counts
#> .feature .sample counts type condition
#> <chr> <chr> <int> <chr> <chr>
#> 1 FBgn0000003 untrt1 0 single_end untreated
#> 2 FBgn0000008 untrt1 92 single_end untreated
#> 3 FBgn0000014 untrt1 5 single_end untreated
#> 4 FBgn0000015 untrt1 0 single_end untreated
#> 5 FBgn0000017 untrt1 4664 single_end untreated
#> 6 FBgn0000018 untrt1 583 single_end untreated
#> 7 FBgn0000022 untrt1 0 single_end untreated
#> 8 FBgn0000024 untrt1 10 single_end untreated
#> 9 FBgn0000028 untrt1 0 single_end untreated
#> 10 FBgn0000032 untrt1 1446 single_end untreated
#> # ℹ 40 more rows
tidySummarizedExperiment::pasilla %>%
nest(data=-condition)
#> # A tibble: 2 × 2
#> condition data
#> <chr> <list>
#> 1 untreated <SmmrzdEx[,4]>
#> 2 treated <SmmrzdEx[,3]>
tidySummarizedExperiment::pasilla %>%
extract(type, into="sequencing", regex="([a-z]*)_end", convert=TRUE)
#> Error in as.vector(x, mode): cannot coerce type 'closure' to vector of type 'any'
# See vignette("pivot") for examples and explanation
library(dplyr)
tidySummarizedExperiment::pasilla %>%
pivot_longer(c(condition, type), names_to="name", values_to="value")
#> tidySummarizedExperiment says: A data frame is returned for independent data analysis.
#> # A tibble: 204,386 × 5
#> .feature .sample counts name value
#> <chr> <chr> <int> <chr> <chr>
#> 1 FBgn0000003 untrt1 0 condition untreated
#> 2 FBgn0000003 untrt1 0 type single_end
#> 3 FBgn0000008 untrt1 92 condition untreated
#> 4 FBgn0000008 untrt1 92 type single_end
#> 5 FBgn0000014 untrt1 5 condition untreated
#> 6 FBgn0000014 untrt1 5 type single_end
#> 7 FBgn0000015 untrt1 0 condition untreated
#> 8 FBgn0000015 untrt1 0 type single_end
#> 9 FBgn0000017 untrt1 4664 condition untreated
#> 10 FBgn0000017 untrt1 4664 type single_end
#> # ℹ 204,376 more rows
# See vignette("pivot") for examples and explanation
library(dplyr)
tidySummarizedExperiment::pasilla %>%
pivot_wider(names_from=feature, values_from=counts)
#> tidySummarizedExperiment says: A data frame is returned for independent data analysis.
#> Warning: tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)
#> # A tibble: 7 × 14,602
#> sample condition type FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015
#> <chr> <chr> <chr> <int> <int> <int> <int>
#> 1 untrt1 untreated single_end 0 92 5 0
#> 2 untrt2 untreated single_end 0 161 1 2
#> 3 untrt3 untreated paired_end 0 76 0 1
#> 4 untrt4 untreated paired_end 0 70 0 2
#> 5 trt1 treated single_end 0 140 4 1
#> 6 trt2 treated paired_end 0 88 0 0
#> 7 trt3 treated paired_end 1 70 0 0
#> # ℹ 14,595 more variables: FBgn0000017 <int>, FBgn0000018 <int>,
#> # FBgn0000022 <int>, FBgn0000024 <int>, FBgn0000028 <int>, FBgn0000032 <int>,
#> # FBgn0000036 <int>, FBgn0000037 <int>, FBgn0000038 <int>, FBgn0000039 <int>,
#> # FBgn0000042 <int>, FBgn0000043 <int>, FBgn0000044 <int>, FBgn0000045 <int>,
#> # FBgn0000046 <int>, FBgn0000047 <int>, FBgn0000052 <int>, FBgn0000053 <int>,
#> # FBgn0000054 <int>, FBgn0000055 <int>, FBgn0000056 <int>, FBgn0000057 <int>,
#> # FBgn0000061 <int>, FBgn0000063 <int>, FBgn0000064 <int>, …
tidySummarizedExperiment::pasilla %>%
unite("group", c(condition, type))
#> tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A SummarizedExperiment-tibble abstraction: 102,193 × 4
#> # Features=14599 | Samples=7 | Assays=counts
#> .feature .sample counts group
#> <chr> <chr> <int> <chr>
#> 1 FBgn0000003 untrt1 0 untreated_single_end
#> 2 FBgn0000008 untrt1 92 untreated_single_end
#> 3 FBgn0000014 untrt1 5 untreated_single_end
#> 4 FBgn0000015 untrt1 0 untreated_single_end
#> 5 FBgn0000017 untrt1 4664 untreated_single_end
#> 6 FBgn0000018 untrt1 583 untreated_single_end
#> 7 FBgn0000022 untrt1 0 untreated_single_end
#> 8 FBgn0000024 untrt1 10 untreated_single_end
#> 9 FBgn0000028 untrt1 0 untreated_single_end
#> 10 FBgn0000032 untrt1 1446 untreated_single_end
#> # ℹ 40 more rows