Share your repls and programming experiences

← Back to all posts
Empirical
chrisaycock (24)

Empirical is a language for time-series analysis. It has builtin Dataframes (tables) and integrated queries. It's fully interactive and already works on Repl.it.

Empirical version 0.6.5
Copyright (C) 2019--2020 Empirical Software Solutions, LLC
Type '?' for help. Type '\help' for magic commands.

>>> let trades = load("trades.csv")

>>> trades
 symbol                  timestamp    price size
   AAPL 2019-05-01 09:30:00.578802 210.5200  780
   AAPL 2019-05-01 09:30:00.580485 210.8100  390
    BAC 2019-05-01 09:30:00.629205  30.2500  510
    CVX 2019-05-01 09:30:00.944122 117.8000 5860
   AAPL 2019-05-01 09:30:01.002405 211.1300  320
   AAPL 2019-05-01 09:30:01.066917 211.1186  310
   AAPL 2019-05-01 09:30:01.118968 211.0000  730
    BAC 2019-05-01 09:30:01.186416  30.2450  380
    CVX 2019-05-01 09:30:01.639577 118.2550 2880
    ...                        ...      ...  ...

Empirical is a normal language with variables, types, functions, etc. Dataframes are just values, but there is plenty of syntactic sugar for using them. For example, here is a simple aggregation on stock trades:

>>> from trades select volume = sum(size) by symbol
 symbol volume
   AAPL 135760
    BAC 223590
    CVX 507580

I can run any expression, including user-defined functions. This computes the weighted average (wavg) given a set of weights (ws) and values (vs):

>>> func wavg(ws, vs) = sum(ws * vs) / sum(ws)

Now I can compute the volume-weighted average price (VWAP), a common metric in finance. I'm going to do this for every five minutes (5m).

>>> from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m)
 symbol           timestamp       vwap
   AAPL 2019-05-01 09:30:00 210.305724
    BAC 2019-05-01 09:30:00  30.483875
    CVX 2019-05-01 09:30:00 119.427733
   AAPL 2019-05-01 09:35:00 202.972440
    BAC 2019-05-01 09:35:00  30.848397
    CVX 2019-05-01 09:35:00 119.431601
   AAPL 2019-05-01 09:40:00 204.671388
    BAC 2019-05-01 09:40:00  30.217362
    CVX 2019-05-01 09:40:00 117.224763
    ...                 ...        ...

I can also perform joins across time series. Let's take a look at some quotes (bid and ask). Notice how these timestamps don't line-up with the trades; a stock exchange will change its quotes constantly before a trade occurs.

>>> let quotes = load("quotes.csv")

>>> quotes
 symbol                  timestamp     bid    ask
   AAPL 2019-05-01 09:30:00.410166 210.450 211.02
   AAPL 2019-05-01 09:30:00.491133 210.800 211.15
    CVX 2019-05-01 09:30:00.544263 117.760 118.34
    BAC 2019-05-01 09:30:00.585684  30.240  30.27
    CVX 2019-05-01 09:30:01.096591 118.220 118.54
   AAPL 2019-05-01 09:30:01.131702 210.925 211.20
   AAPL 2019-05-01 09:30:01.185615 210.980 211.21
   AAPL 2019-05-01 09:30:01.349968 210.730 211.34
   AAPL 2019-05-01 09:30:01.404082 211.150 211.40
    ...                        ...     ...    ...

Empirical can line-up timestamps automatically. Here I ask for the latest quote for each trade:

>>> join trades, quotes on symbol asof timestamp
 symbol                  timestamp    price size    bid    ask
   AAPL 2019-05-01 09:30:00.578802 210.5200  780 210.80 211.15
   AAPL 2019-05-01 09:30:00.580485 210.8100  390 210.80 211.15
    BAC 2019-05-01 09:30:00.629205  30.2500  510  30.24  30.27
    CVX 2019-05-01 09:30:00.944122 117.8000 5860 117.76 118.34
   AAPL 2019-05-01 09:30:01.002405 211.1300  320 210.80 211.15
   AAPL 2019-05-01 09:30:01.066917 211.1186  310 210.80 211.15
   AAPL 2019-05-01 09:30:01.118968 211.0000  730 210.80 211.15
    BAC 2019-05-01 09:30:01.186416  30.2450  380  30.24  30.27
    CVX 2019-05-01 09:30:01.639577 118.2550 2880 118.26 118.37
    ...                        ...      ...  ...    ...    ...

The time-series joins can go in different direction. Here are some made-up events that occurred through-out the trading day:

>>> let events = load("events.csv")

>>> events
 symbol           timestamp code
    CVX 2019-05-01 09:30:03   a1
    BAC 2019-05-01 09:30:04   e3
   AAPL 2019-05-01 09:30:06   f7
    CVX 2019-05-01 09:30:07   h9

I want to know the closest event for each trade. I'm also going to limit the search to within three seconds (3s).

>>> join trades, events on symbol asof timestamp nearest within 3s
 symbol                  timestamp    price size code
   AAPL 2019-05-01 09:30:00.578802 210.5200  780     
   AAPL 2019-05-01 09:30:00.580485 210.8100  390     
    BAC 2019-05-01 09:30:00.629205  30.2500  510     
    CVX 2019-05-01 09:30:00.944122 117.8000 5860   a1
   AAPL 2019-05-01 09:30:01.002405 211.1300  320     
   AAPL 2019-05-01 09:30:01.066917 211.1186  310     
   AAPL 2019-05-01 09:30:01.118968 211.0000  730     
    BAC 2019-05-01 09:30:01.186416  30.2450  380   e3
    CVX 2019-05-01 09:30:01.639577 118.2550 2880   a1
    ...                        ...      ...  ...  ...

The result is a lot of blanks (null values) for items that do not line-up. Empirical handles missing data automatically, so any future activity with these results will cascade forward.

Static typing

What makes Empirical unique is that it is statically typed. The compiler knows before running user code whether it is allowed.

>>> 1 + 2
3

>>> 'a' + 'b'
Error: unable to match overloaded function +
  candidate: (Int64, Int64) -> Int64
    argument type at position 0 does not match: Char vs Int64
  candidate: (Float64, Float64) -> Float64
    argument type at position 0 does not match: Char vs Float64
  candidate: (Int64, Float64) -> Float64
    argument type at position 0 does not match: Char vs Int64
  ...
  <53 others>

This is extremely useful for catching typos. Suppose I want to sort my quotes by the bid-ask spread:

>>> sort quotes by (asks - bid) / bid
Error: symbol asks was not found

Here it caught the misspelled asks. I can correct it to ask:

>>> sort quotes by (ask - bid) / bid
 symbol                  timestamp      bid      ask
    BAC 2019-05-01 09:32:46.313487  30.5650  30.5650
    BAC 2019-05-01 09:32:53.738446  30.6124  30.6124
    BAC 2019-05-01 09:39:24.459415  31.0600  31.0600
   AAPL 2019-05-01 09:45:51.931597 206.9400 206.9500
   AAPL 2019-05-01 09:43:59.903292 206.3200 206.3300
    BAC 2019-05-01 09:32:50.369746  30.6400  30.6417
    CVX 2019-05-01 09:32:57.242072 119.7732 119.7800
   AAPL 2019-05-01 09:38:18.980026 205.1100 205.1222
   AAPL 2019-05-01 09:38:19.978890 205.1100 205.1251
    ...                        ...      ...      ...

All of this error checking is performed before running the user's code. This is beneficial for writing large scripts.

I used to run overnight simulations back during my quantitative finance days, only to find-out in the morning that the program had crashed after a few hours because of a typo later in the script. I made Empirical specifically to prevent this from ever happening again.

How it works

The real "magic" here is that Empirical can infer a Dataframe's type from an external source at compile time. Let's look back at how the table is loaded:

let trades = load("trades.csv")

The path to the file is known at compile time. Empirical samples the file to figure-out what's in it. In fact, any value that can be solved at compile time is acceptable.

let filename = "trades.csv"
let trades = load("./" + filename)

The load() function is actually a macro that invokes a templated function on a templated type:

csv_load{CsvProvider{"trades.csv"}}("trades.csv")

The CsvProvider invokes an internal function that determines the types:

>>> _csv_infer("trades.csv")
"{symbol: String, timestamp: Timestamp, price: Float64, size: Int64}"

And it is this inferred type that is automatically compiled into the user's code. The entire process maintains static typing even though the user didn't explicitly list the types!

So what happens when the file path cannot be determined at compile time? This occurs when using an external variable, like argv when running a script from the command line. (As with many programming languages, argv is the list of the user's command-line arguments.)

For example running this in a script:

let trades = load(argv[1])

gives the following error:

Error: macro parameter filename requires a comptime literal
Error: unable to determine type for trades

So now we must specify the type and invoke the templated function directly.

data Trade:
  symbol: String,
  timestamp: Timestamp,
  price: Float64,
  size: Int64
end

let trades = csv_load{Trade}(argv[1])

I can run my overnight simulation with the confidence that my script doesn't have common typos.

Under the hood

Fuller details can be found on a pair of blog posts (1, 2), but broadly speaking Empirical compiles to a virtual machine, which dispatches to a runtime.

Launching Empirical on the command line with --dump-vvm will show what the Vector Virtual Machine (VVM) is doing. VVM has its own assembly language that you can code in (it's how I do regression tests).

Empirical's

load("trades.csv")

becomes VVM's

$0 = {"symbol": Sv, "timestamp": Tv, "price": f64v, "size": i64v}

@1 = "trades.csv"

load @1 $0 %0

This may look funky, but it simply defines the type ($0), sets a global register for the string that represents the filename (@1), and then invokes the load opcode. The result is saved to the local/temporary register %0.

The VWAP example

func wavg(ws, vs) = sum(ws * vs) / sum(ws)

from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m)

becomes a ton of VVM:

$1 = {"symbol": Sv, "timestamp": Tv}
$2 = {"symbol": Sv, "timestamp": Tv, "vwap": f64v}

@3 = def wavg([Int64], [Float64])("ws": i64v, "vs": f64v) f64s:
  mul_i64v_f64v %0 %1 %2
  sum_f64v %2 %3
  sum_i64v %0 %4
  div_f64s_i64s %3 %4 %5
  ret %5
end

alloc $1 %1
member @1 0 %2
member %1 0 %3
assign %2 Sv %3
member @1 1 %4
unit_m_i64s 5 %5
bar_Tv_Ds %4 %5 %6
member %1 1 %7
assign %6 Tv %7
group $0 @1 $1 %1 $2 %8 %9 %10
assign 0 i64s %11
lt_i64s_i64s %11 %10 %12
bfalse %12 86
member %9 %11 %13
member %13 3 %14
member %13 2 %15
call @3 3 %14 %15 %16
member %8 2 %17
append %16 f64s %17
add_i64s_i64s %11 1 %11
br 47
repr %8 $2 %18
save %18

This monster groups the Dataframe according to the user's criteria (the by clause) and then repeatedly invokes the wavg function on the individual groups.

VVM executes everything in a runtime. Each procedure is vector-aware by default, meaning that the entire column of the Dataframe is handled by one function call in the runtime. Ie., the virtual machine performs just one dispatch for each column routine, which amortizes the cost of using a VM.

Empirical's entire stack is written in C++. Because of the amortized cost, Empirical is about as fast as hand-written C++ if the Dataframes are large enough.

Where to get it

You can run Empirical on Repl.it right now, or download it for your own machine. Be sure to read the tutorial.

The source code is available under the AGPL with the Commons Clause.


Info for the Jam

This submission is on behalf of the Empirical Software team. I am the creator of Empirical and Andrew is my beta tester.

The timing of the Jam straddled my previous sprint (metaprogramming (1, 2)) and my next sprint (streaming computation).

The change log and more granular commit history should serve as proof of work for the Jam period.

Commentshotnewtop
HahaYes (1249)

DANG beats Cookeylang any day
maybe our lang :(

TheDrone7 (1442)

Moving to share since this is just an existing project being ported to repl.it. Even though it has been worked on in the jam duration, it doesn't satisfy the requirements as specified in the blog post.

chrisaycock (24)

@TheDrone7 I went by what was in the blog's FAQ:

Can I remix or improve on an existing language?

Yes, as long as you're adding original ideas and putting an effort to meaningfully change or improve the language.

TheDrone7 (1442)

@chrisaycock we know that but the changes made within the jam duration didn't seem to impact the language enough.

chrisaycock (24)

@TheDrone7 As I stated in the linked change log and commit history, my work during the Jam period added metaprogramming components that allowed for some of the features highlighted in my post here. For example:

  • extensions to generic functions allowed for wavg() to omit explicit types
  • inlining and macros allowed for the seamless load() (it used to be a hardcoded function up until a couple weeks ago)
  • global variable didn't even exist

It's your call to disqualify, but I find very strange when you claim I didn't make changes to the language over the last three weeks.

TheDrone7 (1442)

@chrisaycock I never said you didn't make changes, but majority of the submissions are entirely new languages built within the duration, we expect you to make changes that impact the language that much. Compared to entirely new languages being built, your changes weren't enough.

alfredbirk (1)

@TheDrone7 I'm not a participant or anything, but as an outsider of this contest, I must say I find it weird how these submissions are disqualified. This project has literally been worked on every day during the contest period.

TheDrone7 (1442)

@alfredbirk these submissions are great projects that did require a lot of effort, yes. They are only being disqualified for not being what we were looking for, we set up some requirements and these don't meet them is all. Being disqualified from the jam doesn't mean you didn't make something good, it just means you worked on something we weren't aiming for in the jam.

alfredbirk (1)

@TheDrone7 Hmm, I would say those changes are an effort to meaningfully improve the language, but your call..

alfredbirk (1)

@TheDrone7 Possible not very impactful changes, but definitely an effort to meaningfully improve the language. But I guess it's okay to change the rules?

TheDrone7 (1442)

@alfredbirk I cannot change the rules. The disqualification is due to lack of impact but it will still be showcased as a submission. And possibly, as an honorable mention because I do like the idea behind this.

chrisaycock (24)

@TheDrone7 I'm not asking you to change the rules. I'm asking you to judge me by the merits of what I did during the Jam. Generics are such a difficult feature in statically typed languages that Go and Zig don't have them.

The whole promise of this contest was that I would be judged by language experts.

TheDrone7 (1442)

@chrisaycock you would normally be judged by the experts but I'm just here to make sure the submissions being forwarded to them fulfil the requirements for the jam.
Moreover, I realised this when I went over all your GitHub commits, even if I were to allow this submission, you would've still failed to satisfy the requirement of working as a team of at least 2 members through the jam. All of the commits were made by you. And you're also the only member of the team on repl.it
So I'm afraid, despite all your hard efforts, this would still not be a valid submission.

I do appreciate all that hard work, trust me. But rules are rules and I'm only here to enforce them.

chrisaycock (24)

@TheDrone7 The rules don't say that commits have to be done by two people. It literally does not say that.

My teammate is @AndrewCarr2, who is on Repl.it. He tests my work using the binaries. He used to file issues on GitHub (eg. 1, 2); now he messages me directly and I just CC him.

I hate to belabor this point since you've made-up your mind, but you clearly are not "enforcing the rules" when you are making this up.

TheDrone7 (1442)

@chrisaycock you did not mention your teammate in the post as the submission guidelines suggested. You're clear there then.

As for the language though, I still cannot allow it as it was a mixed decision of 3 people not just mine alone.

chrisaycock (24)

@TheDrone7 I did mention my teammate in my post. Let me quote it for you:

This submission is on behalf of the Empirical Software team. I am the creator of Empirical and Andrew is my beta tester.

TheDrone7 (1442)

@chrisaycock mentioning refers to using the at symbol @ followed by the username, such as - @TheDrone7 it gives us a link to the user profile of your teammates which we can use to ensure that it's not your alternate account since we've had people do that in older jams we organised like this one.

AndrewCarr2 (2)

I never thought my existence would be questioned. I am definitely real and happy to prove it. @isaiah08

https://andrewnc.github.io

chrisaycock (24)

@AndrewCarr2 Yeah, this whole experience has been a disappointment.

TheDrone7 (1442)

@chrisaycock I would like to apologise on their behalf. I've issued a warning against their behaviour. Also a simple note, try to not take any comments strike throughed seriously. It usually means they're saying it as a joke and don't mean it. But as we can both see it can offend others sometimes so I have decided to take action in this case.

TheDrone7 (1442)

@HahaYes I wouldn't recommend relating it to repl.it in any way.

isaiah08 (51)

I am extremely sorry, I didn't mean to offend anyone. @AndrewCarr2 @chrisaycock

AndrewCarr2 (2)

@isaiah08 I know you didn't 👍🏻 No problem here. Feelings are pretty high in this thread, but no harm done.

HahaYes (1249)

yo you are into stocks too? Same!

chrisaycock (24)

@HahaYes After my PhD, I spent a decade working for hedge funds and proprietary trading firms. I specialized in statistical arbitrage and high-frequency trading.

HahaYes (1249)

@chrisaycock whoa. I'm a teen that invests. (Accidentally did a options trade so glad I didn't lose money XD)

fuzzyastrocat (1289)

@HahaYes I wAtcH thE StoCKs seGmEnt oN thE nEWs!

HahaYes (1249)

ok this is too good

chrisaycock (24)

@HahaYes If you want some more information, I announced the first beta on Hacker News here. I put the project on hold for over a year while I dealt with other things. I picked it up again recently to implement some things I've been obsessively thinking about.

HahaYes (1249)

@chrisaycock lol nice. Teens vs Adults. What a showdown

hg0428 (171)

This language is almost as good as ours, I am going to need to hurry up on the object-oriented stuff.