The data frame was a concept I first came across in R, where it is a fundemental component of data analysis. Never having done much data analysis in Python, I came across a situation where I needed a data frame but didn't know about the pandas
implementation, so I went about writing my own DataFrame
class. When I realized the pandas
version existed, I immediately switched all my code over only to find that for my application, the pandas DataFrame was over 2 times slower than my lightweight DataFrame
. So I spent some time making sure the interfaces were the same (at least in what I was doing) and made some nice 'this is what you would expect' modifications for personal use in the future.
In general, the pandas
data frame performs quite well, especially with large datasets. I'm sure that this class has quite a few holes in it, but the idea of a lightweight DataFrame
for Python is worth pursuing in the future. You can find the source code for dflite
on GitHub, including a copy of this notebook. Here's some general usage.
Importing is easy, the only dependency is numpy
.
import dflite as df
Usually all I want to do is create a DataFrame
from a CSV file, but in code there's a couple of other ways to construct the class. The DataFrame.from_records()
method is probably the most useful, creating a DataFrame
from an iterable grouped by record (items coming out of a Postgres database via psycopg2
are a good example). This is more or less equivalent to the pandas
method of the same name. For now, we'll demo the class with the small CSV included in the directory.
data = df.read_csv("test.csv")
data.head()
Time (UTC) | Latitude | Longitude | |
0 | 2016-03-02 17:50:18 | 45.10303743 | -64.29103034 |
1 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 |
2 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 |
3 | 2016-03-02 17:50:21 | 45.1026838 | -64.29084603 |
4 | 2016-03-02 17:50:22 | 45.10259138 | -64.29080328 |
5 | 2016-03-02 17:50:23 | 45.10251977 | -64.29080362 |
The read_csv()
function is pretty much the same as the pandas
version, at least for simple usage. You can also pass in a file-like object and a driver=
parameter. Currently only csv
files are supported, but they're the most common, so hey.
The tail()
method works much the same as the head()
method.
data.tail(3)
Time (UTC) | Latitude | Longitude | |
0 | 2016-03-02 17:50:54 | 45.09937809 | -64.29696471 |
1 | 2016-03-02 17:50:55 | 45.09924766 | -64.2972626 |
2 | 2016-03-02 17:50:56 | 45.09911724 | -64.29754859 |
Notice here how the indicies are 0, 1, and 2, where they should be the last few indicies of the DataFrame
. The pandas.DataFrame
supports (I would argue is slightly obsessed with) thd idea of indexes for rows/columns. For rows I almost never have a reason to access them by anything other than an integer (certainly not a string), and because of this I don't bother with them. Since tail()
is basically just checking the DataFrame
, I didn't bother to re-number the rows (tail()
is actually just shorthand for data.iloc[(len(data)-nrows):len(data)])
, so it's really its own DataFrame
object).
Column names can be accessed and set just like the pandas
version:
data.columns
['Time (UTC)', 'Latitude', 'Longitude']
data.columns = ("col1", "col2", "col3")
data.head()
col1 | col2 | col3 | |
0 | 2016-03-02 17:50:18 | 45.10303743 | -64.29103034 |
1 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 |
2 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 |
3 | 2016-03-02 17:50:21 | 45.1026838 | -64.29084603 |
4 | 2016-03-02 17:50:22 | 45.10259138 | -64.29080328 |
5 | 2016-03-02 17:50:23 | 45.10251977 | -64.29080362 |
Column values can be added and removed in a similar way
data["col1"]
array(['2016-03-02 17:50:18', '2016-03-02 17:50:19', '2016-03-02 17:50:20',
'2016-03-02 17:50:21', '2016-03-02 17:50:22', '2016-03-02 17:50:23',
'2016-03-02 17:50:24', '2016-03-02 17:50:25', '2016-03-02 17:50:26',
'2016-03-02 17:50:27', '2016-03-02 17:50:28', '2016-03-02 17:50:29',
'2016-03-02 17:50:30', '2016-03-02 17:50:31', '2016-03-02 17:50:32',
'2016-03-02 17:50:33', '2016-03-02 17:50:34', '2016-03-02 17:50:35',
'2016-03-02 17:50:36', '2016-03-02 17:50:37', '2016-03-02 17:50:38',
'2016-03-02 17:50:39', '2016-03-02 17:50:40', '2016-03-02 17:50:41',
'2016-03-02 17:50:42', '2016-03-02 17:50:43', '2016-03-02 17:50:44',
'2016-03-02 17:50:45', '2016-03-02 17:50:46', '2016-03-02 17:50:47',
'2016-03-02 17:50:48', '2016-03-02 17:50:49', '2016-03-02 17:50:50',
'2016-03-02 17:50:51', '2016-03-02 17:50:52', '2016-03-02 17:50:53',
'2016-03-02 17:50:54', '2016-03-02 17:50:55', '2016-03-02 17:50:56'],
dtype='<U19')
Columns can also be accessed by index (this isn't possible in the pandas
version, and I'm not quite sure why)
data[0]
array(['2016-03-02 17:50:18', '2016-03-02 17:50:19', '2016-03-02 17:50:20',
'2016-03-02 17:50:21', '2016-03-02 17:50:22', '2016-03-02 17:50:23',
'2016-03-02 17:50:24', '2016-03-02 17:50:25', '2016-03-02 17:50:26',
'2016-03-02 17:50:27', '2016-03-02 17:50:28', '2016-03-02 17:50:29',
'2016-03-02 17:50:30', '2016-03-02 17:50:31', '2016-03-02 17:50:32',
'2016-03-02 17:50:33', '2016-03-02 17:50:34', '2016-03-02 17:50:35',
'2016-03-02 17:50:36', '2016-03-02 17:50:37', '2016-03-02 17:50:38',
'2016-03-02 17:50:39', '2016-03-02 17:50:40', '2016-03-02 17:50:41',
'2016-03-02 17:50:42', '2016-03-02 17:50:43', '2016-03-02 17:50:44',
'2016-03-02 17:50:45', '2016-03-02 17:50:46', '2016-03-02 17:50:47',
'2016-03-02 17:50:48', '2016-03-02 17:50:49', '2016-03-02 17:50:50',
'2016-03-02 17:50:51', '2016-03-02 17:50:52', '2016-03-02 17:50:53',
'2016-03-02 17:50:54', '2016-03-02 17:50:55', '2016-03-02 17:50:56'],
dtype='<U19')
data["newcol"] = 10
data.head()
col1 | col2 | col3 | newcol | |
0 | 2016-03-02 17:50:18 | 45.10303743 | -64.29103034 | 10 |
1 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 | 10 |
2 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 | 10 |
3 | 2016-03-02 17:50:21 | 45.1026838 | -64.29084603 | 10 |
4 | 2016-03-02 17:50:22 | 45.10259138 | -64.29080328 | 10 |
5 | 2016-03-02 17:50:23 | 45.10251977 | -64.29080362 | 10 |
data["newcol"] = data["newcol"] + 4
data.head()
col1 | col2 | col3 | newcol | |
0 | 2016-03-02 17:50:18 | 45.10303743 | -64.29103034 | 14 |
1 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 | 14 |
2 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 | 14 |
3 | 2016-03-02 17:50:21 | 45.1026838 | -64.29084603 | 14 |
4 | 2016-03-02 17:50:22 | 45.10259138 | -64.29080328 | 14 |
5 | 2016-03-02 17:50:23 | 45.10251977 | -64.29080362 | 14 |
del data["newcol"]
data.head()
col1 | col2 | col3 | |
0 | 2016-03-02 17:50:18 | 45.10303743 | -64.29103034 |
1 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 |
2 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 |
3 | 2016-03-02 17:50:21 | 45.1026838 | -64.29084603 |
4 | 2016-03-02 17:50:22 | 45.10259138 | -64.29080328 |
5 | 2016-03-02 17:50:23 | 45.10251977 | -64.29080362 |
In pandas
, rows are accessed through the iloc
attribute, so after considerable changing of code, so does mine. Here, data.iloc[3]
will give the fourth row (as a dict
ish object), and data.iloc[3, :]
will give a DataFrame
with only one row. The pandas
version also has a loc[]
option where names can be specified, but in this implementation iloc
and loc
are identical, and so you can pass more or less anything between the brackets and get a sensible result.
row = data.iloc[3]
row
col1 | col2 | col3 |
2016-03-02 17:50:21 | 45.1026838 | -64.29084603 |
Of course, I've made the nice _repr_html_()
method so it displas nicely but each row is actually a _DFRow
object, which is a subclass of dict
that keeps its values in order. This means you can index it by column name or by index.
row["col1"]
'2016-03-02 17:50:21'
row[0]
'2016-03-02 17:50:21'
Iterating through rows is done using the itertuples()
method, which returns an iterator that iterates through the rows in the same way as the pandas
version. Because pandas
returns its row with the 0
th item as the row number (or row index, if you believe in that kind of thing), this method does as well.
for row in data.head().itertuples():
print(row[0], row["col2"], row[1])
0 45.10303743 2016-03-02 17:50:18
1 45.10291441 2016-03-02 17:50:19
2 45.10279595 2016-03-02 17:50:20
3 45.1026838 2016-03-02 17:50:21
4 45.10259138 2016-03-02 17:50:22
5 45.10251977 2016-03-02 17:50:23
Each column is a NumPy ndarray
object, so it can be indexed like any other ndarray
object (i.e. by a list
of desired rows, by an ndarray
of logicals, by a single index, or by a slice
). Some of this notation is available in the iloc
method as well, which returns a single value (if two int
s are passed), a _DFRow
(if only a single integer is passed), or a subsetted DataFrame
(if some combination of slices/ints/lists is passed). See the following examples:
data.iloc[1:3]
col1 | col2 | col3 | |
0 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 |
1 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 |
data.iloc[1:5, 0:2]
col1 | col2 | |
0 | 2016-03-02 17:50:19 | 45.10291441 |
1 | 2016-03-02 17:50:20 | 45.10279595 |
2 | 2016-03-02 17:50:21 | 45.1026838 |
3 | 2016-03-02 17:50:22 | 45.10259138 |
data.iloc[[2, 5, 5, 33], ("col1", "col3")]
col1 | col3 | |
0 | 2016-03-02 17:50:20 | -64.29089237 |
1 | 2016-03-02 17:50:23 | -64.29080362 |
2 | 2016-03-02 17:50:23 | -64.29080362 |
3 | 2016-03-02 17:50:51 | -64.29613321 |
Notice again how our original row number aren't preserved. You can work around this by making a column with your original row numbers. I get how this could be annoying, but including it was too complicated and wasn't necessary for what I was doing.
data["original_rows"] = list(range(len(data)))
data.iloc[[2, 5, 5, 33], ("original_rows", "col1", "col3")]
original_rows | col1 | col3 | |
0 | 2 | 2016-03-02 17:50:20 | -64.29089237 |
1 | 5 | 2016-03-02 17:50:23 | -64.29080362 |
2 | 5 | 2016-03-02 17:50:23 | -64.29080362 |
3 | 33 | 2016-03-02 17:50:51 | -64.29613321 |
All of the nice indexing things we can do with NumPy are also available in the 'rows' part of the index:
data.iloc[data["col2"] > 45.1022]
col1 | col2 | col3 | original_rows | |
0 | 2016-03-02 17:50:18 | 45.10303743 | -64.29103034 | 0 |
1 | 2016-03-02 17:50:19 | 45.10291441 | -64.29095464 | 1 |
2 | 2016-03-02 17:50:20 | 45.10279595 | -64.29089237 | 2 |
3 | 2016-03-02 17:50:21 | 45.1026838 | -64.29084603 | 3 |
4 | 2016-03-02 17:50:22 | 45.10259138 | -64.29080328 | 4 |
5 | 2016-03-02 17:50:23 | 45.10251977 | -64.29080362 | 5 |
6 | 2016-03-02 17:50:24 | 45.10245523 | -64.29083152 | 6 |
7 | 2016-03-02 17:50:25 | 45.10240112 | -64.29086638 | 7 |
8 | 2016-03-02 17:50:26 | 45.10233343 | -64.2909347 | 8 |
9 | 2016-03-02 17:50:27 | 45.10227411 | -64.29102036 | 9 |
10 | 2016-03-02 17:50:28 | 45.1022154 | -64.29111654 | 10 |
Writing the DataFrame
to a CSV is probably the easiest way to export, although TSV is also supported. The to_csv()
method works more or less like the pandas
version, and can take a file-like object as well as a filename.
As I mentioned earlier, running the pandas.DataFrame
in production code that used quite a lot of DataFrame
s was quite slow. I have a feeling that there's a lot of overhead involved with the convenience of multiple indexing and built-in plotting support that slows the class down when there isn't a need for it. There's also probably a lot of work to be done on this class that can add convenience without comprimising performance, but I'll leave that up to some folks with a bit more spare time than I do. Cheers!