Working with Data Frames in Julia

This tutorial will show you how to work with a data frame in Julia. A data frame is an object used to store data in Julia, and it is similar in structure to a spreadsheet. However, the data are accessed and manipulated using functions in Julia instead of by pointing and clicking in a graphic user interface (GUI). A data frame differs from an array in that it can contain multiple data types (e.g. numbers and words together) whereas an array can only contain one type (e.g. only numbers or only words).

You will learn:

How to create a data frame.
Different ways of subsetting/filtering data in a data frame.
How to use data frames in calculations.
How to add and remove values from a data frame.

Throughout this exercise I will work with Anscombe's quartet. This consists of 4 sets of similar $X$ and $Y$ data (here named 1, 2, 3, and 4).

3.5 ms

Creating a Data Frame

3.6 μs

Step 1: Load `DataFrames` package

Anytime you work with a data frame you will need to load the DataFrames package.

3.9 μs

xxxxxxxxxx
 
# using Pkg; Pkg.add("DataFrames")

60.0 ns

xxxxxxxxxx
 
using DataFrames

1.1 s

Step 2: Create a data frame

Create the data frame by entering arrays of values into the DataFrame() function. Each array is a separate column in the data frame. The general syntax is:

DataFrame(colname1 = [val1, val2, ...], colname2 = [val1, val2, ...], ...)

I will store the data frame in an object called anscombe.

1.4 ms

Here are the data to include:

    X_1 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
        # X_1 will be the name of column 1
    Y_1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
        # Y_1 will be the name of column 2
    X_2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
    Y_2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 8.26, 4.74],
    X_3 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
    Y_3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
    X_4 = [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 11.0, 8.0, 8.0, 8.0],
    Y_4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

Values were obtained from the Wikipedia article on Anscombe's quartet: Anscombe's quartet

7.6 μs

anscombe

	X_1	Y_1	X_2	Y_2	X_3	Y_3	X_4	Y_4
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
2	8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
3	13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
4	9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
5	11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
6	14.0	9.96	14.0	8.1	14.0	8.84	8.0	7.04
7	6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
8	4.0	4.26	4.0	3.1	4.0	5.39	11.0	12.5
9	12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
10	7.0	4.82	7.0	8.26	7.0	6.42	8.0	7.91
11	5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

xxxxxxxxxx
 
anscombe = DataFrame(
    X_1 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
        # X_1 will be the name of column 1
    Y_1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
        # Y_1 will be the name of column 2
    X_2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
    Y_2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 8.26, 4.74],
    X_3 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
    Y_3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
    X_4 = [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 11.0, 8.0, 8.0, 8.0],
    Y_4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
    )

18.3 μs

Working with Data Frames

5.6 μs

Subsetting Data

It is possible to access the individual values (cells) in a data frame by their (row, column) index. For example, df[1, 2] would call the value in row 1, column 2 of the data frame df.
It is possible to access an entire row by it's index. For example, df[1, :] would acces the entire first row of the data frame df.
It is similarly possible to access an entire column. For example, df[!, 2] would access column 2 in df. Columns can also be accessed by name, as you'll see below.

8.4 μs

Remember that anscombe is composed of four datasets: 1, 2, 3, and 4. Suppose we only want to use the second one (i.e. columns 3 and 4). We could subset anscombe accordingly.

2.4 μs

Subsetting Columns

3.3 μs

Float641

10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0

xxxxxxxxxx
 
anscombe[!, 3]

30.0 ns

Notice that column 3 (X_3) has been printed out. Here, 3 is the column number and the exclamation point indicates that we'd like all rows used. We could achieve the same effect by using the column's name instead of index. (In this case the column name is X_3.) I prefer to use column names when possible because it makes the code easier for humans to read.

10.0 μs

Float641

10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0

xxxxxxxxxx
 
anscombe[!, :X_3]

1.6 μs

or, equivalently:

1.8 μs

Float641

10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0

xxxxxxxxxx
 
anscombe[!, "X_3"]

2.0 μs

Multiple columns can be selected by giving multiple column indices.

1.8 μs

	X_2	Y_2
	Float64	Float64
1	10.0	9.14
2	8.0	8.14
3	13.0	8.74
4	9.0	8.77
5	11.0	9.26
6	14.0	8.1
7	6.0	6.13
8	4.0	3.1
9	12.0	9.13
10	7.0	8.26
11	5.0	4.74

xxxxxxxxxx
 
anscombe[!, [3,4]]

5.3 μs

Or, eqivalently:

1.7 μs

	X_2	Y_2
	Float64	Float64
1	10.0	9.14
2	8.0	8.14
3	13.0	8.74
4	9.0	8.77
5	11.0	9.26
6	14.0	8.1
7	6.0	6.13
8	4.0	3.1
9	12.0	9.13
10	7.0	8.26
11	5.0	4.74

xxxxxxxxxx
 
anscombe[!, [:X_2, :Y_2]]

4.5 μs

Subsetting Rows

Row subsetting works similarly, but a colon is used instead of an exclamation point. For example, to select row 4 from anscombe:

3.7 μs

DataFrameRow (8 columns)

	X_1	Y_1	X_2	Y_2	X_3	Y_3	X_4	Y_4
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
3	13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71

xxxxxxxxxx
 
anscombe[3,:]

580 ns

Calcualtions with Data Frames

1.8 μs

Subsetting can be useful when executing functions (addition, subtraction, plotting, fitting data, ..., etc.) to only a portion of the data. For example, to add 5 to every value in column 2:

1.8 μs

Float641

13.04

11.95

12.58

13.81

13.33

14.96

12.24

9.26

15.84

9.82

10.68

xxxxxxxxxx
 
anscombe[!, 2] .+ 5

3.9 μs

Manipulating Data Frames

1.8 μs

Adding to an Existing Data Frame

New columns can be added to a data simply by providing a new column name. Here I'll create a column Y_1_plus5 that is equal to column Y_1 plus 5 added to each row.

3.4 μs

Float641

13.04

11.95

12.58

13.81

13.33

14.96

12.24

9.26

15.84

9.82

10.68

xxxxxxxxxx
 
anscombe[!, :Y_1_plus5] = anscombe[!, 2] .+ 5

3.3 μs

	X_1	Y_1	X_2	Y_2	X_3	Y_3	X_4	Y_4	Y_1_plus5
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58	13.04
2	8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76	11.95
3	13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71	12.58
4	9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84	13.81
5	11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47	13.33
6	14.0	9.96	14.0	8.1	14.0	8.84	8.0	7.04	14.96
7	6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25	12.24
8	4.0	4.26	4.0	3.1	4.0	5.39	11.0	12.5	9.26
9	12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56	15.84
10	7.0	4.82	7.0	8.26	7.0	6.42	8.0	7.91	9.82
11	5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89	10.68

xxxxxxxxxx
 
anscombe

20.0 ns

Removing Data from a Data Frame

But what if I don't want that column I just added? I can select only the columns I want with select!() and leave out the rest. The exclamation point here indicates that the function will modify an pre-exisiting object (in this case the data frame anscombe).

3.1 μs

	X_1	Y_1	X_2	Y_2	X_3	Y_3	X_4	Y_4
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
2	8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
3	13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
4	9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
5	11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
6	14.0	9.96	14.0	8.1	14.0	8.84	8.0	7.04
7	6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
8	4.0	4.26	4.0	3.1	4.0	5.39	11.0	12.5
9	12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
10	7.0	4.82	7.0	8.26	7.0	6.42	8.0	7.91
11	5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

xxxxxxxxxx
 
select!(anscombe, Not(:Y_1_plus5))

14.6 μs

Since I want to select every column but not Y_1_plus5 I can use Not(Y_1_plus5). An alternative way to do this would be select!(anscombe, [1:7]) to select columns 1 through 7 but not 8.

3.1 μs

Saving Data Frames to File

1.8 μs

Sometimes you may want to write the data frame to a file. Typically tabular data such as this are stored in plain text files with the extension CSV for comma-separated values. See the CSV tutorial for more information on how to save a data frame to a CSV file.

5.2 μs

Working with Data Frames in Julia

Creating a Data Frame

Step 1: Load DataFrames package

Step 2: Create a data frame

Working with Data Frames

Subsetting Data

Calcualtions with Data Frames

Manipulating Data Frames

Saving Data Frames to File

Step 1: Load `DataFrames` package