Working with large and complex sets of data is a day-to-day reality in applied statistics. The package
dplyr provides a well structured set of functions for manipulating such data collections and performing typical operations with standard syntax that makes them easier to remember. It is also very fast, even with large collections. To increase it's applicability, the functions work with connections to databases as well as data.frames. dplyr builds on plyr and incorporates features of Data.Table, which is known for being fast snf efficient in handling large datasets.
As a data source to illustrate properties with we'll use the flights data that we're already familiar with.
Using dplyr to group, manipulate and summarize data Working with large and complex sets of data is a day-to-day reality in applied statistics. The package dplyr provides a well structured set of functions for manipulating such data collections and performing typical operations with standard syntax that makes them easier to remember. Dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code. Compared to the previous release, this version of dtplyr is a complete rewrite that focusses only on lazy evaluation triggered by use of lazydt. Chapter 1: Getting started with dplyr Remarks This section provides an overview of what dplyr is, and why a developer might want to use it. It should also mention any large subjects within dplyr, and link out to the related topics. Since the Documentation for dplyr is new, you may need to create initial versions of those related topics. Dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) Compute one or more new columns. Drop original columns. Summarise uses summary functions, functions that take a vector of values and return a single value, such as: Mutate uses window functions, functions that take a vector of.
There are over a quarter of a million records and 21 variables, which is good sized. dplyr can work fine with data.frames like this, but converting it to a
tbl_df object gives a nice summary view of the data:
It prints sample data appropriate foir the window size.
Basic manipulations of data
Much work with data involvces subsetting, defining new columns, sorting or otherwise manipulating the data. dplyr has five functions (verbs) for such actions, that all start with a data.frame or
tbl_df and produce another one.
Here we got the January flights for AA. This is like
subset but the syntax is a little different. We don't need &; it is added to comma separated conditions. For an 'or' you add explicitly.
This function reorders the data based on specified columns.
This could be done with
order but the syntax is much harder.
This works like the select option to subset.
This adds new columns, often computed on old ones. But you can refer to new coilumns you just created.
This produces a summary statistic, which when computed on the un-grouped data isn't very interesting.
A major strength of dplyr is the ability to group the data by a variable or variables and then operate on the data 'by group'. With plyr you can do much the same using the
ddply function or it's relatives,
daply. However, there are advantages to having grouped data as an object in its own right.
Problem: Compute mean arrival delay by plane, along with other useful data.
The dplyr way to do this is as follows.
First create a version of the data grouped by plane.
Shows all the data but indicates a group.
The information we want are summary statistics by plane. Just use the
Giving us nice summary statistics per plane. The syntax is easier to understand and it's faster.
n() is one of several aggregate functions that are useful to employ with
summarise on grouped data. Besides the typical ones like
max, etc., there are also
Grouping by multiple variables
When we do this we have the ability to easily compute summary stats by different combinations of the grouping variables.
Suppose we group the data into daily flights.
We have access to each of the grouping variables. Notice that in the summary data.frame, we have Year and Month as grouping variables. We can get the number of flights per month by summarizing as follows.
Now the only grouping variable is year. We backed out of the grouping variables by granularity. This is OK for counts and sums but for variances, e.g., this wouldn't work. You need to compute on the raw variables.
There is a nice way to pass the result of one function to another. This is possible because so many
dplyr functions take a data table as input and output another data table.
Working with databases
dplyr has been written to work with data.frames and connections to remote databases in a variety of formats. This permits handling very large amounts of data with a standard syntax.
Here we'll do an example of working with an SQLite database.
dplyr contains all we need to set up a sample database on disk and connect to it.
This is a database connection, although there is nothing in it yet. Now we'll copy a bunch of flight data into it.
This copies the hflights df and creates indices on the day, carrier and tailnumber to aid searching on these variables. hflights_sqlite is a table object that behaves like a data.frame table but is connected to the SQLite database created on the disk.
The basic verbs for manipulating and transforming data tables operate the same way.
R only reaches into the database when absolutely necessary.
- It never pulls data back to R unless you explicitly ask for it.
- It delays doing any work until the last possible minute, collecting together everything you want to do then sending that to the database in one step.
All of this is happening in R to tables inside the R session but no calls have been made to the SQLlite database until we require c4 to be printed.
This only pulled out 10 rows. Notice that it retains reference to the chain of operations that created it; it looks like more than a table.
The component that may be most informative is query.
This is the SQL code that is actually executed on the database.
To tell R to complete this call to the database and download all rows we use the command
This has lost the SQLite feature; it is just a data.frame table.