Given a regular expression with capturing groups, extract() turns each group into a new column. If the groups don't match, or the input is NA, the output will be NA.

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider()

Learn more in vignette("pivot").

pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().

Learn more in vignette("pivot").

Convenience function to paste together multiple columns into one.

Given either a regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

Arguments

keep_empty

See tidyr::unnest

ptype

See tidyr::unnest

.drop

See tidyr::unnest

.id

tidyr::unnest

.sep

tidyr::unnest

.preserve

See tidyr::unnest

.data

A tbl. (See tidyr)

.names_sep

See ?tidyr::nest

into

Names of new variables to create as character vector. Use NA to omit the variable in the output.

regex

a regular expression used to extract the desired values. There should be one group (defined by ()) for each element of into.

convert

If TRUE, will run type.convert() with as.is=TRUE on new columns. This is useful if the component columns are integer, numeric or logical.

NB: this will cause string "NA"s to be converted to NAs.

cols

<tidy-select> Columns to pivot into longer format.

cols_vary

When pivoting cols into longer format, how should the output rows be arranged relative to their original row number?

  • "fastest", the default, keeps individual rows from cols close together in the output. This often produces intuitively ordered output when you have at least one key column from data that is not involved in the pivoting process.

  • "slowest" keeps individual columns from cols close together in the output. This often produces intuitively ordered output when you utilize all of the columns from data in the pivoting process.

names_to

A character vector specifying the new column or columns to create from the information stored in the column names of data specified by cols.

  • If length 0, or if NULL is supplied, no columns will be created.

  • If length 1, a single column will be created which will contain the column names specified by cols.

  • If length >1, multiple columns will be created. In this case, one of names_sep or names_pattern must be supplied to specify how the column names should be split. There are also two additional character values you can take advantage of:

    • NA will discard the corresponding component of the column name.

    • ".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely.

names_sep, names_pattern

If names_to contains multiple values, these arguments control how the column name is broken up.

names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).

names_pattern takes the same specification as extract(), a regular expression containing matching groups (()).

If these arguments do not give you enough control, use pivot_longer_spec() to create a spec object and process manually as needed.

names_repair

What happens if the output has invalid column names? The default, "check_unique" is to error if the columns are duplicated. Use "minimal" to allow duplicates in the output, or "unique" to de-duplicated by adding numeric suffixes. See vctrs::vec_as_names() for more options.

values_to

A string specifying the name of the column to create from the data stored in cell values. If names_to is a character containing the special .value sentinel, this value will be ignored, and the name of the value column will be derived from part of the existing column names.

values_drop_na

If TRUE, will drop rows that contain only NAs in the value_to column. This effectively converts explicit missing values to implicit missing values, and should generally be used only when missing values in data were created by its structure.

names_transform, values_transform

Optionally, a list of column name-function pairs. Alternatively, a single function can be supplied, which will be applied to all columns. Use these arguments if you need to change the types of specific columns. For example, names_transform = list(week = as.integer) would convert a character variable called week to an integer.

If not specified, the type of the columns generated from names_to will be character, and the type of the variables generated from values_to will be the common type of the input columns used to generate them.

names_ptypes, values_ptypes

Optionally, a list of column name-prototype pairs. Alternatively, a single empty prototype can be supplied, which will be applied to all columns. A prototype (or ptype for short) is a zero-length vector (like integer() or numeric()) that defines the type, class, and attributes of a vector. Use these arguments if you want to confirm that the created columns are the types that you expect. Note that if you want to change (instead of confirm) the types of specific columns, you should use names_transform or values_transform instead.

id_cols

<tidy-select> A set of columns that uniquely identify each observation. Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables.

Defaults to all columns in data except for the columns specified through names_from and values_from. If a tidyselect expression is supplied, it will be evaluated on data after removing the columns specified through names_from and values_from.

id_expand

Should the values in the id_cols columns be expanded by expand() before pivoting? This results in more rows, the output will contain a complete expansion of all possible values in id_cols. Implicit factor levels that aren't represented in the data will become explicit. Additionally, the row values corresponding to the expanded id_cols will be sorted.

names_from, values_from

<tidy-select> A pair of arguments describing which column (or columns) to get the name of the output column (names_from), and which column (or columns) to get the cell values from (values_from).

If values_from contains multiple values, the value will be added to the front of the output column.

names_sep

If names_from or values_from contains multiple variables, this will be used to join their values together into a single string to use as a column name.

names_prefix

String added to the start of every variable name. This is particularly useful if names_from is a numeric vector and you want to create syntactic variable names.

names_glue

Instead of names_sep and names_prefix, you can supply a glue specification that uses the names_from columns (and special .value) to create custom column names.

names_sort

Should the column names be sorted? If FALSE, the default, column names are ordered by first appearance.

names_vary

When names_from identifies a column (or columns) with multiple unique values, and multiple values_from columns are provided, in what order should the resulting column names be combined?

  • "fastest" varies names_from values fastest, resulting in a column naming scheme of the form: value1_name1, value1_name2, value2_name1, value2_name2. This is the default.

  • "slowest" varies names_from values slowest, resulting in a column naming scheme of the form: value1_name1, value2_name1, value1_name2, value2_name2.

names_expand

Should the values in the names_from columns be expanded by expand() before pivoting? This results in more columns, the output will contain column names corresponding to a complete expansion of all possible values in names_from. Implicit factor levels that aren't represented in the data will become explicit. Additionally, the column names will be sorted, identical to what names_sort would produce.

values_fill

Optionally, a (scalar) value that specifies what each value should be filled in with when missing.

This can be a named list if you want to apply different fill values to different value columns.

values_fn

Optionally, a function applied to the value in each cell in the output. You will typically use this when the combination of id_cols and names_from columns does not uniquely identify an observation.

This can be a named list if you want to apply different aggregations to different values_from columns.

unused_fn

Optionally, a function applied to summarize the values from the unused columns (i.e. columns not identified by id_cols, names_from, or values_from).

The default drops all unused columns from the result.

This can be a named list if you want to apply different aggregations to different unused columns.

id_cols must be supplied for unused_fn to be useful, since otherwise all unspecified columns will be considered id_cols.

This is similar to grouping by the id_cols then summarizing the unused columns using unused_fn.

data

A data frame.

col

The name of the new column, as a string or symbol.

This argument is passed by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with rlang::ensym() (note that this kind of interface where symbols do not represent actual objects is now discouraged in the tidyverse; we support it here for backward compatibility).

...

<tidy-select> Columns to unite

na.rm

If TRUE, missing values will be remove prior to uniting each value.

remove

If TRUE, remove input columns from output data frame.

sep

Separator between columns.

If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

If numeric, sep is interpreted as character positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.

extra

If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:

  • "warn" (the default): emit a warning and drop extra values.

  • "drop": drop any extra values without a warning.

  • "merge": only splits at most length(into) times

fill

If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:

  • "warn" (the default): emit a warning and fill from the right

  • "right": fill with missing values on the right

  • "left": fill with missing values on the left

Value

A tidySummarizedExperiment objector a tibble depending on input

A tidySummarizedExperiment objector a tibble depending on input

A tidySummarizedExperiment objector a tibble depending on input

A tidySummarizedExperiment objector a tibble depending on input

A tidySummarizedExperiment objector a tibble depending on input

A tidySummarizedExperiment objector a tibble depending on input

Details

pivot_longer() is an updated approach to gather(), designed to be both simpler to use and to handle more use cases. We recommend you use pivot_longer() for new code; gather() isn't going away but is no longer under active development.

pivot_wider() is an updated approach to spread(), designed to be both simpler to use and to handle more use cases. We recommend you use pivot_wider() for new code; spread() isn't going away but is no longer under active development.

See also

separate() to split up by a separator.

pivot_wider_spec() to pivot "by hand" with a data frame that defines a pivotting specification.

separate(), the complement.

unite(), the complement, extract() which uses regular expression capturing groups.

Examples


tidySummarizedExperiment::pasilla %>%

    nest(data=-condition) %>%
    unnest(data)
#> # A SummarizedExperiment-tibble abstraction: 102,193 × 5
#> # Features=14599 | Samples=7 | Assays=counts
#>    .feature    .sample counts type       condition
#>    <chr>       <chr>    <int> <chr>      <chr>    
#>  1 FBgn0000003 untrt1       0 single_end untreated
#>  2 FBgn0000008 untrt1      92 single_end untreated
#>  3 FBgn0000014 untrt1       5 single_end untreated
#>  4 FBgn0000015 untrt1       0 single_end untreated
#>  5 FBgn0000017 untrt1    4664 single_end untreated
#>  6 FBgn0000018 untrt1     583 single_end untreated
#>  7 FBgn0000022 untrt1       0 single_end untreated
#>  8 FBgn0000024 untrt1      10 single_end untreated
#>  9 FBgn0000028 untrt1       0 single_end untreated
#> 10 FBgn0000032 untrt1    1446 single_end untreated
#> # ℹ 40 more rows


tidySummarizedExperiment::pasilla %>%

    nest(data=-condition)
#> # A tibble: 2 × 2
#>   condition data          
#>   <chr>     <list>        
#> 1 untreated <SmmrzdEx[,4]>
#> 2 treated   <SmmrzdEx[,3]>


tidySummarizedExperiment::pasilla %>%

    extract(type, into="sequencing", regex="([a-z]*)_end", convert=TRUE)
#> Error in as.vector(x, mode): cannot coerce type 'closure' to vector of type 'any'
    
# See vignette("pivot") for examples and explanation

library(dplyr)
tidySummarizedExperiment::pasilla %>%

    pivot_longer(c(condition, type), names_to="name", values_to="value")
#> tidySummarizedExperiment says: A data frame is returned for independent data analysis.
#> # A tibble: 204,386 × 5
#>    .feature    .sample counts name      value     
#>    <chr>       <chr>    <int> <chr>     <chr>     
#>  1 FBgn0000003 untrt1       0 condition untreated 
#>  2 FBgn0000003 untrt1       0 type      single_end
#>  3 FBgn0000008 untrt1      92 condition untreated 
#>  4 FBgn0000008 untrt1      92 type      single_end
#>  5 FBgn0000014 untrt1       5 condition untreated 
#>  6 FBgn0000014 untrt1       5 type      single_end
#>  7 FBgn0000015 untrt1       0 condition untreated 
#>  8 FBgn0000015 untrt1       0 type      single_end
#>  9 FBgn0000017 untrt1    4664 condition untreated 
#> 10 FBgn0000017 untrt1    4664 type      single_end
#> # ℹ 204,376 more rows
    
# See vignette("pivot") for examples and explanation

library(dplyr)
tidySummarizedExperiment::pasilla %>%

    pivot_wider(names_from=feature, values_from=counts)
#> tidySummarizedExperiment says: A data frame is returned for independent data analysis.
#> Warning: tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)
#> # A tibble: 7 × 14,602
#>   sample condition type       FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015
#>   <chr>  <chr>     <chr>            <int>       <int>       <int>       <int>
#> 1 untrt1 untreated single_end           0          92           5           0
#> 2 untrt2 untreated single_end           0         161           1           2
#> 3 untrt3 untreated paired_end           0          76           0           1
#> 4 untrt4 untreated paired_end           0          70           0           2
#> 5 trt1   treated   single_end           0         140           4           1
#> 6 trt2   treated   paired_end           0          88           0           0
#> 7 trt3   treated   paired_end           1          70           0           0
#> # ℹ 14,595 more variables: FBgn0000017 <int>, FBgn0000018 <int>,
#> #   FBgn0000022 <int>, FBgn0000024 <int>, FBgn0000028 <int>, FBgn0000032 <int>,
#> #   FBgn0000036 <int>, FBgn0000037 <int>, FBgn0000038 <int>, FBgn0000039 <int>,
#> #   FBgn0000042 <int>, FBgn0000043 <int>, FBgn0000044 <int>, FBgn0000045 <int>,
#> #   FBgn0000046 <int>, FBgn0000047 <int>, FBgn0000052 <int>, FBgn0000053 <int>,
#> #   FBgn0000054 <int>, FBgn0000055 <int>, FBgn0000056 <int>, FBgn0000057 <int>,
#> #   FBgn0000061 <int>, FBgn0000063 <int>, FBgn0000064 <int>, …

tidySummarizedExperiment::pasilla %>%

    unite("group", c(condition, type))
#> tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A SummarizedExperiment-tibble abstraction: 102,193 × 4
#> # Features=14599 | Samples=7 | Assays=counts
#>    .feature    .sample counts group               
#>    <chr>       <chr>    <int> <chr>               
#>  1 FBgn0000003 untrt1       0 untreated_single_end
#>  2 FBgn0000008 untrt1      92 untreated_single_end
#>  3 FBgn0000014 untrt1       5 untreated_single_end
#>  4 FBgn0000015 untrt1       0 untreated_single_end
#>  5 FBgn0000017 untrt1    4664 untreated_single_end
#>  6 FBgn0000018 untrt1     583 untreated_single_end
#>  7 FBgn0000022 untrt1       0 untreated_single_end
#>  8 FBgn0000024 untrt1      10 untreated_single_end
#>  9 FBgn0000028 untrt1       0 untreated_single_end
#> 10 FBgn0000032 untrt1    1446 untreated_single_end
#> # ℹ 40 more rows