Working with Data Frames in Julia

This tutorial will show you how to work with a data frame in Julia. A data frame is an object used to store data in Julia, and it is similar in structure to a spreadsheet. However, the data are accessed and manipulated using functions in Julia instead of by pointing and clicking in a graphic user interface (GUI). A data frame differs from an array in that it can contain multiple data types (e.g. numbers and words together) whereas an array can only contain one type (e.g. only numbers or only words).

You will learn:

  • How to create a data frame.

  • Different ways of subsetting/filtering data in a data frame.

  • How to use data frames in calculations.

  • How to add and remove values from a data frame.

Throughout this exercise I will work with Anscombe's quartet. This consists of 4 sets of similar X and Y data (here named 1, 2, 3, and 4).

3.5 ms

Creating a Data Frame

3.6 μs

Step 1: Load DataFrames package

Anytime you work with a data frame you will need to load the DataFrames package.

3.9 μs
60.0 ns
1.1 s

Step 2: Create a data frame

Create the data frame by entering arrays of values into the DataFrame() function. Each array is a separate column in the data frame. The general syntax is:

DataFrame(colname1 = [val1, val2, ...], colname2 = [val1, val2, ...], ...)

I will store the data frame in an object called anscombe.

1.4 ms

Here are the data to include:

    X_1 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
        # X_1 will be the name of column 1
    Y_1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
        # Y_1 will be the name of column 2
    X_2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
    Y_2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 8.26, 4.74],
    X_3 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
    Y_3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
    X_4 = [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 11.0, 8.0, 8.0, 8.0],
    Y_4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

Values were obtained from the Wikipedia article on Anscombe's quartet: Anscombe's quartet

7.6 μs
anscombe
X_1Y_1X_2Y_2X_3Y_3X_4Y_4
Float64Float64Float64Float64Float64Float64Float64Float64
1
10.0
8.04
10.0
9.14
10.0
7.46
8.0
6.58
2
8.0
6.95
8.0
8.14
8.0
6.77
8.0
5.76
3
13.0
7.58
13.0
8.74
13.0
12.74
8.0
7.71
4
9.0
8.81
9.0
8.77
9.0
7.11
8.0
8.84
5
11.0
8.33
11.0
9.26
11.0
7.81
8.0
8.47
6
14.0
9.96
14.0
8.1
14.0
8.84
8.0
7.04
7
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.25
8
4.0
4.26
4.0
3.1
4.0
5.39
11.0
12.5
9
12.0
10.84
12.0
9.13
12.0
8.15
8.0
5.56
10
7.0
4.82
7.0
8.26
7.0
6.42
8.0
7.91
11
5.0
5.68
5.0
4.74
5.0
5.73
8.0
6.89
18.3 μs

Working with Data Frames

5.6 μs

Subsetting Data

  • It is possible to access the individual values (cells) in a data frame by their (row, column) index. For example, df[1, 2] would call the value in row 1, column 2 of the data frame df.

  • It is possible to access an entire row by it's index. For example, df[1, :] would acces the entire first row of the data frame df.

  • It is similarly possible to access an entire column. For example, df[!, 2] would access column 2 in df. Columns can also be accessed by name, as you'll see below.

8.4 μs

Remember that anscombe is composed of four datasets: 1, 2, 3, and 4. Suppose we only want to use the second one (i.e. columns 3 and 4). We could subset anscombe accordingly.

2.4 μs

Subsetting Columns

3.3 μs
30.0 ns

Notice that column 3 (X_3) has been printed out. Here, 3 is the column number and the exclamation point indicates that we'd like all rows used. We could achieve the same effect by using the column's name instead of index. (In this case the column name is X_3.) I prefer to use column names when possible because it makes the code easier for humans to read.

10.0 μs
1.6 μs

or, equivalently:

1.8 μs
2.0 μs

Multiple columns can be selected by giving multiple column indices.

1.8 μs
X_2Y_2
Float64Float64
1
10.0
9.14
2
8.0
8.14
3
13.0
8.74
4
9.0
8.77
5
11.0
9.26
6
14.0
8.1
7
6.0
6.13
8
4.0
3.1
9
12.0
9.13
10
7.0
8.26
11
5.0
4.74
5.3 μs

Or, eqivalently:

1.7 μs
X_2Y_2
Float64Float64
1
10.0
9.14
2
8.0
8.14
3
13.0
8.74
4
9.0
8.77
5
11.0
9.26
6
14.0
8.1
7
6.0
6.13
8
4.0
3.1
9
12.0
9.13
10
7.0
8.26
11
5.0
4.74
4.5 μs

Subsetting Rows

Row subsetting works similarly, but a colon is used instead of an exclamation point. For example, to select row 4 from anscombe:

3.7 μs

DataFrameRow (8 columns)

X_1Y_1X_2Y_2X_3Y_3X_4Y_4
Float64Float64Float64Float64Float64Float64Float64Float64
313.07.5813.08.7413.012.748.07.71
580 ns

Calcualtions with Data Frames

1.8 μs

Subsetting can be useful when executing functions (addition, subtraction, plotting, fitting data, ..., etc.) to only a portion of the data. For example, to add 5 to every value in column 2:

1.8 μs
3.9 μs

Manipulating Data Frames

1.8 μs

Adding to an Existing Data Frame

New columns can be added to a data simply by providing a new column name. Here I'll create a column Y_1_plus5 that is equal to column Y_1 plus 5 added to each row.

3.4 μs
3.3 μs
X_1Y_1X_2Y_2X_3Y_3X_4Y_4Y_1_plus5
Float64Float64Float64Float64Float64Float64Float64Float64Float64
1
10.0
8.04
10.0
9.14
10.0
7.46
8.0
6.58
13.04
2
8.0
6.95
8.0
8.14
8.0
6.77
8.0
5.76
11.95
3
13.0
7.58
13.0
8.74
13.0
12.74
8.0
7.71
12.58
4
9.0
8.81
9.0
8.77
9.0
7.11
8.0
8.84
13.81
5
11.0
8.33
11.0
9.26
11.0
7.81
8.0
8.47
13.33
6
14.0
9.96
14.0
8.1
14.0
8.84
8.0
7.04
14.96
7
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.25
12.24
8
4.0
4.26
4.0
3.1
4.0
5.39
11.0
12.5
9.26
9
12.0
10.84
12.0
9.13
12.0
8.15
8.0
5.56
15.84
10
7.0
4.82
7.0
8.26
7.0
6.42
8.0
7.91
9.82
11
5.0
5.68
5.0
4.74
5.0
5.73
8.0
6.89
10.68
20.0 ns

Removing Data from a Data Frame

But what if I don't want that column I just added? I can select only the columns I want with select!() and leave out the rest. The exclamation point here indicates that the function will modify an pre-exisiting object (in this case the data frame anscombe).

3.1 μs
X_1Y_1X_2Y_2X_3Y_3X_4Y_4
Float64Float64Float64Float64Float64Float64Float64Float64
1
10.0
8.04
10.0
9.14
10.0
7.46
8.0
6.58
2
8.0
6.95
8.0
8.14
8.0
6.77
8.0
5.76
3
13.0
7.58
13.0
8.74
13.0
12.74
8.0
7.71
4
9.0
8.81
9.0
8.77
9.0
7.11
8.0
8.84
5
11.0
8.33
11.0
9.26
11.0
7.81
8.0
8.47
6
14.0
9.96
14.0
8.1
14.0
8.84
8.0
7.04
7
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.25
8
4.0
4.26
4.0
3.1
4.0
5.39
11.0
12.5
9
12.0
10.84
12.0
9.13
12.0
8.15
8.0
5.56
10
7.0
4.82
7.0
8.26
7.0
6.42
8.0
7.91
11
5.0
5.68
5.0
4.74
5.0
5.73
8.0
6.89
14.6 μs

Since I want to select every column but not Y_1_plus5 I can use Not(Y_1_plus5). An alternative way to do this would be select!(anscombe, [1:7]) to select columns 1 through 7 but not 8.

3.1 μs

Saving Data Frames to File

1.8 μs

Sometimes you may want to write the data frame to a file. Typically tabular data such as this are stored in plain text files with the extension CSV for comma-separated values. See the CSV tutorial for more information on how to save a data frame to a CSV file.

5.2 μs