Working with Data Frames in Julia
This tutorial will show you how to work with a data frame in Julia. A data frame is an object used to store data in Julia, and it is similar in structure to a spreadsheet. However, the data are accessed and manipulated using functions in Julia instead of by pointing and clicking in a graphic user interface (GUI). A data frame differs from an array in that it can contain multiple data types (e.g. numbers and words together) whereas an array can only contain one type (e.g. only numbers or only words).
You will learn:
How to create a data frame.
Different ways of subsetting/filtering data in a data frame.
How to use data frames in calculations.
How to add and remove values from a data frame.
Throughout this exercise I will work with Anscombe's quartet. This consists of 4 sets of similar
Creating a Data Frame
Step 1: Load DataFrames
package
Anytime you work with a data frame you will need to load the DataFrames package.
xxxxxxxxxx
# using Pkg; Pkg.add("DataFrames")
xxxxxxxxxx
using DataFrames
Step 2: Create a data frame
Create the data frame by entering arrays of values into the DataFrame()
function. Each array is a separate column in the data frame. The general syntax is:
DataFrame(colname1 = [val1, val2, ...], colname2 = [val1, val2, ...], ...)
I will store the data frame in an object called anscombe
.
Here are the data to include:
X_1 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
# X_1 will be the name of column 1
Y_1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
# Y_1 will be the name of column 2
X_2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
Y_2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 8.26, 4.74],
X_3 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
Y_3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
X_4 = [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 11.0, 8.0, 8.0, 8.0],
Y_4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
Values were obtained from the Wikipedia article on Anscombe's quartet: Anscombe's quartet
X_1 | Y_1 | X_2 | Y_2 | X_3 | Y_3 | X_4 | Y_4 | |
---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |
2 | 8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
3 | 13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
4 | 9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |
5 | 11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |
6 | 14.0 | 9.96 | 14.0 | 8.1 | 14.0 | 8.84 | 8.0 | 7.04 |
7 | 6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |
8 | 4.0 | 4.26 | 4.0 | 3.1 | 4.0 | 5.39 | 11.0 | 12.5 |
9 | 12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |
10 | 7.0 | 4.82 | 7.0 | 8.26 | 7.0 | 6.42 | 8.0 | 7.91 |
11 | 5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |
xxxxxxxxxx
anscombe = DataFrame(
X_1 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
# X_1 will be the name of column 1
Y_1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
# Y_1 will be the name of column 2
X_2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
Y_2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 8.26, 4.74],
X_3 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
Y_3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
X_4 = [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 11.0, 8.0, 8.0, 8.0],
Y_4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
)
Working with Data Frames
Subsetting Data
It is possible to access the individual values (cells) in a data frame by their (row, column) index. For example,
df[1, 2]
would call the value in row 1, column 2 of the data framedf
.It is possible to access an entire row by it's index. For example,
df[1, :]
would acces the entire first row of the data framedf
.It is similarly possible to access an entire column. For example,
df[!, 2]
would access column 2 indf
. Columns can also be accessed by name, as you'll see below.
Remember that anscombe
is composed of four datasets: 1, 2, 3, and 4. Suppose we only want to use the second one (i.e. columns 3 and 4). We could subset anscombe
accordingly.
Subsetting Columns
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
xxxxxxxxxx
anscombe[!, 3]
Notice that column 3 (X_3
) has been printed out. Here, 3
is the column number and the exclamation point indicates that we'd like all rows used. We could achieve the same effect by using the column's name instead of index. (In this case the column name is X_3
.) I prefer to use column names when possible because it makes the code easier for humans to read.
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
xxxxxxxxxx
anscombe[!, :X_3]
or, equivalently:
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
xxxxxxxxxx
anscombe[!, "X_3"]
Multiple columns can be selected by giving multiple column indices.
X_2 | Y_2 | |
---|---|---|
Float64 | Float64 | |
1 | 10.0 | 9.14 |
2 | 8.0 | 8.14 |
3 | 13.0 | 8.74 |
4 | 9.0 | 8.77 |
5 | 11.0 | 9.26 |
6 | 14.0 | 8.1 |
7 | 6.0 | 6.13 |
8 | 4.0 | 3.1 |
9 | 12.0 | 9.13 |
10 | 7.0 | 8.26 |
11 | 5.0 | 4.74 |
xxxxxxxxxx
anscombe[!, [3,4]]
Or, eqivalently:
X_2 | Y_2 | |
---|---|---|
Float64 | Float64 | |
1 | 10.0 | 9.14 |
2 | 8.0 | 8.14 |
3 | 13.0 | 8.74 |
4 | 9.0 | 8.77 |
5 | 11.0 | 9.26 |
6 | 14.0 | 8.1 |
7 | 6.0 | 6.13 |
8 | 4.0 | 3.1 |
9 | 12.0 | 9.13 |
10 | 7.0 | 8.26 |
11 | 5.0 | 4.74 |
xxxxxxxxxx
anscombe[!, [:X_2, :Y_2]]
Subsetting Rows
Row subsetting works similarly, but a colon is used instead of an exclamation point. For example, to select row 4 from anscombe
:
DataFrameRow (8 columns)
X_1 | Y_1 | X_2 | Y_2 | X_3 | Y_3 | X_4 | Y_4 | |
---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
3 | 13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
xxxxxxxxxx
anscombe[3,:]
Calcualtions with Data Frames
Subsetting can be useful when executing functions (addition, subtraction, plotting, fitting data, ..., etc.) to only a portion of the data. For example, to add 5 to every value in column 2:
13.04
11.95
12.58
13.81
13.33
14.96
12.24
9.26
15.84
9.82
10.68
xxxxxxxxxx
anscombe[!, 2] .+ 5
Manipulating Data Frames
Adding to an Existing Data Frame
New columns can be added to a data simply by providing a new column name. Here I'll create a column Y_1_plus5
that is equal to column Y_1
plus 5 added to each row.
13.04
11.95
12.58
13.81
13.33
14.96
12.24
9.26
15.84
9.82
10.68
xxxxxxxxxx
anscombe[!, :Y_1_plus5] = anscombe[!, 2] .+ 5
X_1 | Y_1 | X_2 | Y_2 | X_3 | Y_3 | X_4 | Y_4 | Y_1_plus5 | |
---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 | 13.04 |
2 | 8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 | 11.95 |
3 | 13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 | 12.58 |
4 | 9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 | 13.81 |
5 | 11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 | 13.33 |
6 | 14.0 | 9.96 | 14.0 | 8.1 | 14.0 | 8.84 | 8.0 | 7.04 | 14.96 |
7 | 6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 | 12.24 |
8 | 4.0 | 4.26 | 4.0 | 3.1 | 4.0 | 5.39 | 11.0 | 12.5 | 9.26 |
9 | 12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 | 15.84 |
10 | 7.0 | 4.82 | 7.0 | 8.26 | 7.0 | 6.42 | 8.0 | 7.91 | 9.82 |
11 | 5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 | 10.68 |
xxxxxxxxxx
anscombe
Removing Data from a Data Frame
But what if I don't want that column I just added? I can select only the columns I want with select!()
and leave out the rest. The exclamation point here indicates that the function will modify an pre-exisiting object (in this case the data frame anscombe
).
X_1 | Y_1 | X_2 | Y_2 | X_3 | Y_3 | X_4 | Y_4 | |
---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |
2 | 8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
3 | 13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
4 | 9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |
5 | 11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |
6 | 14.0 | 9.96 | 14.0 | 8.1 | 14.0 | 8.84 | 8.0 | 7.04 |
7 | 6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |
8 | 4.0 | 4.26 | 4.0 | 3.1 | 4.0 | 5.39 | 11.0 | 12.5 |
9 | 12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |
10 | 7.0 | 4.82 | 7.0 | 8.26 | 7.0 | 6.42 | 8.0 | 7.91 |
11 | 5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |
xxxxxxxxxx
select!(anscombe, Not(:Y_1_plus5))
Since I want to select every column but not Y_1_plus5
I can use Not(Y_1_plus5)
. An alternative way to do this would be select!(anscombe, [1:7])
to select columns 1 through 7 but not 8.
Saving Data Frames to File
Sometimes you may want to write the data frame to a file. Typically tabular data such as this are stored in plain text files with the extension CSV for comma-separated values. See the CSV tutorial for more information on how to save a data frame to a CSV file.