Dflite

Lightweight DataFrame for data science.

View the Project on GitHub paleolimbot/dflite

DFLite: The Quick & Easy DataFrame

The data frame was a concept I first came across in R, where it is a fundemental component of data analysis. Never having done much data analysis in Python, I came across a situation where I needed a data frame but didn't know about the pandas implementation, so I went about writing my own DataFrame class. When I realized the pandas version existed, I immediately switched all my code over only to find that for my application, the pandas DataFrame was over 2 times slower than my lightweight DataFrame. So I spent some time making sure the interfaces were the same (at least in what I was doing) and made some nice 'this is what you would expect' modifications for personal use in the future.

In general, the pandas data frame performs quite well, especially with large datasets. I'm sure that this class has quite a few holes in it, but the idea of a lightweight DataFrame for Python is worth pursuing in the future. You can find the source code for dflite on GitHub, including a copy of this notebook. Here's some general usage.

Importing

Importing is easy, the only dependency is numpy.

import dflite as df

Creating a DataFrame

Usually all I want to do is create a DataFrame from a CSV file, but in code there's a couple of other ways to construct the class. The DataFrame.from_records() method is probably the most useful, creating a DataFrame from an iterable grouped by record (items coming out of a Postgres database via psycopg2 are a good example). This is more or less equivalent to the pandas method of the same name. For now, we'll demo the class with the small CSV included in the directory.

data = df.read_csv("test.csv")
data.head()
Time (UTC) Latitude Longitude
0 2016-03-02 17:50:18 45.10303743 -64.29103034
1 2016-03-02 17:50:19 45.10291441 -64.29095464
2 2016-03-02 17:50:20 45.10279595 -64.29089237
3 2016-03-02 17:50:21 45.1026838 -64.29084603
4 2016-03-02 17:50:22 45.10259138 -64.29080328
5 2016-03-02 17:50:23 45.10251977 -64.29080362

The read_csv() function is pretty much the same as the pandas version, at least for simple usage. You can also pass in a file-like object and a driver= parameter. Currently only csv files are supported, but they're the most common, so hey.

The tail() method works much the same as the head() method.

data.tail(3)
Time (UTC) Latitude Longitude
0 2016-03-02 17:50:54 45.09937809 -64.29696471
1 2016-03-02 17:50:55 45.09924766 -64.2972626
2 2016-03-02 17:50:56 45.09911724 -64.29754859

Notice here how the indicies are 0, 1, and 2, where they should be the last few indicies of the DataFrame. The pandas.DataFrame supports (I would argue is slightly obsessed with) thd idea of indexes for rows/columns. For rows I almost never have a reason to access them by anything other than an integer (certainly not a string), and because of this I don't bother with them. Since tail() is basically just checking the DataFrame, I didn't bother to re-number the rows (tail() is actually just shorthand for data.iloc[(len(data)-nrows):len(data)]), so it's really its own DataFrame object).

Columns

Column names can be accessed and set just like the pandas version:

data.columns
['Time (UTC)', 'Latitude', 'Longitude']
data.columns = ("col1", "col2", "col3")
data.head()
col1 col2 col3
0 2016-03-02 17:50:18 45.10303743 -64.29103034
1 2016-03-02 17:50:19 45.10291441 -64.29095464
2 2016-03-02 17:50:20 45.10279595 -64.29089237
3 2016-03-02 17:50:21 45.1026838 -64.29084603
4 2016-03-02 17:50:22 45.10259138 -64.29080328
5 2016-03-02 17:50:23 45.10251977 -64.29080362

Column values can be added and removed in a similar way

data["col1"]
array(['2016-03-02 17:50:18', '2016-03-02 17:50:19', '2016-03-02 17:50:20',
       '2016-03-02 17:50:21', '2016-03-02 17:50:22', '2016-03-02 17:50:23',
       '2016-03-02 17:50:24', '2016-03-02 17:50:25', '2016-03-02 17:50:26',
       '2016-03-02 17:50:27', '2016-03-02 17:50:28', '2016-03-02 17:50:29',
       '2016-03-02 17:50:30', '2016-03-02 17:50:31', '2016-03-02 17:50:32',
       '2016-03-02 17:50:33', '2016-03-02 17:50:34', '2016-03-02 17:50:35',
       '2016-03-02 17:50:36', '2016-03-02 17:50:37', '2016-03-02 17:50:38',
       '2016-03-02 17:50:39', '2016-03-02 17:50:40', '2016-03-02 17:50:41',
       '2016-03-02 17:50:42', '2016-03-02 17:50:43', '2016-03-02 17:50:44',
       '2016-03-02 17:50:45', '2016-03-02 17:50:46', '2016-03-02 17:50:47',
       '2016-03-02 17:50:48', '2016-03-02 17:50:49', '2016-03-02 17:50:50',
       '2016-03-02 17:50:51', '2016-03-02 17:50:52', '2016-03-02 17:50:53',
       '2016-03-02 17:50:54', '2016-03-02 17:50:55', '2016-03-02 17:50:56'], 
      dtype='<U19')

Columns can also be accessed by index (this isn't possible in the pandas version, and I'm not quite sure why)

data[0]
array(['2016-03-02 17:50:18', '2016-03-02 17:50:19', '2016-03-02 17:50:20',
       '2016-03-02 17:50:21', '2016-03-02 17:50:22', '2016-03-02 17:50:23',
       '2016-03-02 17:50:24', '2016-03-02 17:50:25', '2016-03-02 17:50:26',
       '2016-03-02 17:50:27', '2016-03-02 17:50:28', '2016-03-02 17:50:29',
       '2016-03-02 17:50:30', '2016-03-02 17:50:31', '2016-03-02 17:50:32',
       '2016-03-02 17:50:33', '2016-03-02 17:50:34', '2016-03-02 17:50:35',
       '2016-03-02 17:50:36', '2016-03-02 17:50:37', '2016-03-02 17:50:38',
       '2016-03-02 17:50:39', '2016-03-02 17:50:40', '2016-03-02 17:50:41',
       '2016-03-02 17:50:42', '2016-03-02 17:50:43', '2016-03-02 17:50:44',
       '2016-03-02 17:50:45', '2016-03-02 17:50:46', '2016-03-02 17:50:47',
       '2016-03-02 17:50:48', '2016-03-02 17:50:49', '2016-03-02 17:50:50',
       '2016-03-02 17:50:51', '2016-03-02 17:50:52', '2016-03-02 17:50:53',
       '2016-03-02 17:50:54', '2016-03-02 17:50:55', '2016-03-02 17:50:56'], 
      dtype='<U19')
data["newcol"] = 10
data.head()
col1 col2 col3 newcol
0 2016-03-02 17:50:18 45.10303743 -64.29103034 10
1 2016-03-02 17:50:19 45.10291441 -64.29095464 10
2 2016-03-02 17:50:20 45.10279595 -64.29089237 10
3 2016-03-02 17:50:21 45.1026838 -64.29084603 10
4 2016-03-02 17:50:22 45.10259138 -64.29080328 10
5 2016-03-02 17:50:23 45.10251977 -64.29080362 10
data["newcol"] = data["newcol"] + 4
data.head()
col1 col2 col3 newcol
0 2016-03-02 17:50:18 45.10303743 -64.29103034 14
1 2016-03-02 17:50:19 45.10291441 -64.29095464 14
2 2016-03-02 17:50:20 45.10279595 -64.29089237 14
3 2016-03-02 17:50:21 45.1026838 -64.29084603 14
4 2016-03-02 17:50:22 45.10259138 -64.29080328 14
5 2016-03-02 17:50:23 45.10251977 -64.29080362 14
del data["newcol"]
data.head()
col1 col2 col3
0 2016-03-02 17:50:18 45.10303743 -64.29103034
1 2016-03-02 17:50:19 45.10291441 -64.29095464
2 2016-03-02 17:50:20 45.10279595 -64.29089237
3 2016-03-02 17:50:21 45.1026838 -64.29084603
4 2016-03-02 17:50:22 45.10259138 -64.29080328
5 2016-03-02 17:50:23 45.10251977 -64.29080362

Rows

In pandas, rows are accessed through the iloc attribute, so after considerable changing of code, so does mine. Here, data.iloc[3] will give the fourth row (as a dict ish object), and data.iloc[3, :] will give a DataFrame with only one row. The pandas version also has a loc[] option where names can be specified, but in this implementation iloc and loc are identical, and so you can pass more or less anything between the brackets and get a sensible result.

row = data.iloc[3]
row
col1 col2 col3
2016-03-02 17:50:21 45.1026838 -64.29084603

Of course, I've made the nice _repr_html_() method so it displas nicely but each row is actually a _DFRow object, which is a subclass of dict that keeps its values in order. This means you can index it by column name or by index.

row["col1"]
'2016-03-02 17:50:21'
row[0]
'2016-03-02 17:50:21'

Iterating through rows is done using the itertuples() method, which returns an iterator that iterates through the rows in the same way as the pandas version. Because pandas returns its row with the 0th item as the row number (or row index, if you believe in that kind of thing), this method does as well.

for row in data.head().itertuples():
    print(row[0], row["col2"], row[1])
0 45.10303743 2016-03-02 17:50:18
1 45.10291441 2016-03-02 17:50:19
2 45.10279595 2016-03-02 17:50:20
3 45.1026838 2016-03-02 17:50:21
4 45.10259138 2016-03-02 17:50:22
5 45.10251977 2016-03-02 17:50:23

Subsetting

Each column is a NumPy ndarray object, so it can be indexed like any other ndarray object (i.e. by a list of desired rows, by an ndarray of logicals, by a single index, or by a slice). Some of this notation is available in the iloc method as well, which returns a single value (if two ints are passed), a _DFRow (if only a single integer is passed), or a subsetted DataFrame (if some combination of slices/ints/lists is passed). See the following examples:

data.iloc[1:3]
col1 col2 col3
0 2016-03-02 17:50:19 45.10291441 -64.29095464
1 2016-03-02 17:50:20 45.10279595 -64.29089237
data.iloc[1:5, 0:2]
col1 col2
0 2016-03-02 17:50:19 45.10291441
1 2016-03-02 17:50:20 45.10279595
2 2016-03-02 17:50:21 45.1026838
3 2016-03-02 17:50:22 45.10259138
data.iloc[[2, 5, 5, 33], ("col1", "col3")]
col1 col3
0 2016-03-02 17:50:20 -64.29089237
1 2016-03-02 17:50:23 -64.29080362
2 2016-03-02 17:50:23 -64.29080362
3 2016-03-02 17:50:51 -64.29613321

Notice again how our original row number aren't preserved. You can work around this by making a column with your original row numbers. I get how this could be annoying, but including it was too complicated and wasn't necessary for what I was doing.

data["original_rows"] = list(range(len(data)))
data.iloc[[2, 5, 5, 33], ("original_rows", "col1", "col3")]
original_rows col1 col3
0 2 2016-03-02 17:50:20 -64.29089237
1 5 2016-03-02 17:50:23 -64.29080362
2 5 2016-03-02 17:50:23 -64.29080362
3 33 2016-03-02 17:50:51 -64.29613321

All of the nice indexing things we can do with NumPy are also available in the 'rows' part of the index:

data.iloc[data["col2"] > 45.1022]
col1 col2 col3 original_rows
0 2016-03-02 17:50:18 45.10303743 -64.29103034 0
1 2016-03-02 17:50:19 45.10291441 -64.29095464 1
2 2016-03-02 17:50:20 45.10279595 -64.29089237 2
3 2016-03-02 17:50:21 45.1026838 -64.29084603 3
4 2016-03-02 17:50:22 45.10259138 -64.29080328 4
5 2016-03-02 17:50:23 45.10251977 -64.29080362 5
6 2016-03-02 17:50:24 45.10245523 -64.29083152 6
7 2016-03-02 17:50:25 45.10240112 -64.29086638 7
8 2016-03-02 17:50:26 45.10233343 -64.2909347 8
9 2016-03-02 17:50:27 45.10227411 -64.29102036 9
10 2016-03-02 17:50:28 45.1022154 -64.29111654 10

Exporting

Writing the DataFrame to a CSV is probably the easiest way to export, although TSV is also supported. The to_csv() method works more or less like the pandas version, and can take a file-like object as well as a filename.

Performance

As I mentioned earlier, running the pandas.DataFrame in production code that used quite a lot of DataFrames was quite slow. I have a feeling that there's a lot of overhead involved with the convenience of multiple indexing and built-in plotting support that slows the class down when there isn't a need for it. There's also probably a lot of work to be done on this class that can add convenience without comprimising performance, but I'll leave that up to some folks with a bit more spare time than I do. Cheers!