Parsing historical MBTA data

Link to this section Transit and Data

Transit systems and public data are a great match. In my daily life, I interact with so many devices and applications which pull from transit data, from the LED matrix in my living room, to the app on my phone, and to the countdown clocks within the station itself. It's also incredibly accessible, too, as the myriad of DIY projects pulling in transit data demonstrates.

There are, in general, two types of data about a transit system: what the system is in theory (schedules, routes, and stations determined months in advance), and what the system is in practice (vehicle locations, arrival predictions, and dropped/added trips updated in realtime).

Link to this section Transit Systems in Theory

Describing what a transit system does in theory is the easy part. The General Transit Feed Specification defines a common format made up (in true 2006 fashion) of a set of TXT files contained within a ZIP file, each describing a different aspect of the system. For instance, stops.txt describes the vehicle stops and stations in the system, stop_times.txt describes when each stop is serviced by a vehicle trip, and trips.txt can specify the train number, whether bikes are aloud, and more.

The benefit of this approach is clear: sharing a common standard means that code written for one city can work seamlessly with others. Smaller transit operators can create these files by hand, and larger ones can build up the necessary automation to handle hundreds of routes and thousands of stops. Since these ZIP files usually only change when new schedules are determined, distributing them is straightforward and storing historical ones is easy to do.

Link to this section Transit Systems in Practice

Realtime data is where things get messy. The requirements are more demanding. Data usually needs to be generated and distributed without a human in the loop, and clients need to pull updates every minute or faster.

GTFS does offer a solution to this in the form of GTFS-Realtime. However, being a Google initiative, they chose to build this using Protobuf as an interchange format, which requires specific language bindings to work with. My home system, the MBTA, chose to offer a JSON version of their feeds as well.

Even still, I tend to use their excellent, well-documented service API which makes it easier to ingest only the data relevant to the lines and stations I need. Interoperability with other systems is usually not a priority for my projects.

GTFS-Realtime, however, does not specify how historical data should be represented and stored, leaving transit systems to invent bespoke formats for this data -- that is, if they choose to make it available at all.

Link to this section Historical Data at the MBTA

Link to this section LAMP

The MBTA has a service called LAMP (Lightweight Application for Measuring Performance), which does three things:

  1. Publish historical GTFS schedule data, showing the state of the system "in theory" for any arbitrary date since 2009.
  2. Publish historical subway performance data (the system "in practice") sorted by service date.
  3. Publish miscellaneous datasets for MBTA-internal uses (while these are public, they are entirely undocumented).

That second point is what we'll focus on for parsing historical realtime data.

Link to this section Open Data Portal

The MBTA Open Data Portal contains lots of additional reports generated by the MBTA covering ridership, predictions accuracy, and more across the various transit modes. One such dataset is the Bus Arrival Departure Times 2025 dataset, which nicely complements the subway data published by the LAMP team.

Link to this section Let's get parsing

Link to this section Subway data

Link to this section Data format

The subway data we're interested in is distributed in the Parquet format, using URLs like this for each day of service:

https://performancedata.mbta.com/lamp/subway-on-time-performance-v1/YYYY-MM-DD-subway-on-time-performance-v1.parquet

The Parquet file format is an Apache project specification for storing tabular data efficiently. They pack the column data efficiently but don't require the receiver to know the schema definition like Protobuf does, which means we can easily throw these files into Pandas, the popular Python data analysis library.

import pandas as pd
path = "https://performancedata.mbta.com/lamp/subway-on-time-performance-v1/2025-10-31-subway-on-time-performance-v1.parquet"
df = pd.read_parquet(path)
with open("data.json", "w") as out:
out.write(df.to_json(orient="records"))

We can see that the data has 27 columns. While these aren't documented anywhere, here's how I assume the data is structured:

  • Each entry describes a vehicle arriving and/or departing a station as part of a revenue trip.
  • stop_id and parent_station describe which platform and station the vehicle was at: stop_id identifies the platform (for instance, 70513 means the northbound platform at East Somerville), and parent_station describes the station it belongs to (in this case, place-esomr).
  • move_timestamp seems to be when the train starting moving towards the given station, and stop_timestamp is when it reached the station.
  • travel_time_seconds seems to be the amount of time it took the train to reach the given station, and dwell_time_seconds is how long it spent there.
  • service_date describes the service date as an integer with a decimal expansion of the form YYYYMMDD... bruh
  • route_id defines the specific route, such as Blue, Green-E, or Red. branch_route_id defines the branch, such as Blue (no branching), Green-E, or Red-A. I am unsure why the Red line branching is treated differently than the Green line here.
  • direction_id is either true or false, depending on which way the train is heading. direction is the human-readable name, like South. direction_destination is the direction given as a destination station or station pair, like Ashmont/Braintree or Boston College.
  • start_time seems to be the time that the vehicle started moving, given as "seconds since midnight at the start of the service day". Since the MBTA defines "service days" as starting and ending around 3 AM, the first vehicles have a start_time of around 17680 (4:54 AM) and the last ones have a start_time of around 95976 (2:39 AM). stop_count is the number of stops that vehicle has made since that time, I guess?
  • vehicle_id is a unique identifier for the vehicle, vehicle_label is a human-readable label (usually the number of the first one or two cars), and vehicle_consist is the car numbers of each car in the train.
  • trip_id identifies the trip that the vehicle was on.

Link to this section Simulating trips

Suppose I had been on the southbound platform at Sullivan Station at 5:15 PM. When would I have made it to North Station?

Let's start by finding all the trips which arrived at North Station coming southbound.

trips_to_target = df[(df["parent_station"] == "place-north") & (df["direction"] == "South")]
Stop IDParent StationMove TimestampStop TimestampRoute IDDirectionTrip ID
70026place-north1761902928.01761903013.0OrangeSouth70525780
70026place-north1761903114.01761903199.0OrangeSouth70525786
70026place-north1761903280.01761903367.0OrangeSouth70525792
70026place-north1761903673.01761903759.0OrangeSouth70525798

Now, let's find the earliest one of those trips which also stopped at Sullivan.

trip_ids = trips_to_target["trip_id"].unique()
trips_from_start = df[
(df["parent_station"] == "place-sull")
& (df["direction"] == "South")
& (df["trip_id"].isin(trip_ids))
]
Stop IDParent StationMove TimestampStop TimestampRoute IDDirectionTrip ID
70030place-sull1761902684.01761902737.0OrangeSouth70525780
70030place-sull1761902876.01761902929.0OrangeSouth70525786
70030place-sull1761903061.01761903110.0OrangeSouth70525792
70030place-sull1761903454.01761903502.0OrangeSouth70525798

Right now, most of these are the same trip IDs, but now the timestamps match up with the train's stop at Sullivan. If we were dealing with different branches of the Green Line, for instance, this step would also filter out trips which don't run between our station pair.

Finally, let's find the first trip from the start which departed after we got to the station.

Times in this data are represented as Unix timestamps (i.e., seconds since 1970), seemingly in the local timezone (U.S. Eastern time). So, for instance, the first trip in our list arrived at the station at 1761902737 seconds, or at 5:25:37 AM on 2025-10-31.

import datetime
timestamp = datetime.datetime.fromisoformat("2025-10-31 17:15:00").timestamp()
trips_after_time = trips_from_start[trips_from_start["stop_timestamp"] > timestamp]
Stop IDParent StationMove TimestampStop TimestampRoute IDDirectionTrip ID
70030place-sull1761946007.01761946055.0OrangeSouth70526030
70030place-sull1761946114.01761946163.0OrangeSouth70526038
70030place-sull1761946438.01761946487.0OrangeSouth70526046
70030place-sull1761946646.01761946693.0OrangeSouth70526054

The first of those trips is trip ID 70526030, which arrived at Sullivan at 5:27:35 PM.

next_train = trips_after_time.loc[trips_after_time["stop_timestamp"].idxmin()]
train_arrival = df[
(df["trip_id"] == next_train["trip_id"])
& (df["parent_station"] == "place-north")
]

Looking up its arrival into North Station, we see:

Stop IDParent StationMove TimestampStop TimestampRoute IDDirectionTrip ID
70026place-north1761946237.01761946322.0OrangeSouth70526030

It appears we would have arrived at 5:32:02 PM.

Link to this section Bus data

The bus data is a little different. The MBTA provides it as a ZIP file for each year, with a CSV file for each month. At time of writing, the latest one is MBTA-Bus-Arrival-Departure-Times_2025-09.csv.

The data is pretty straightforward, providing the following columns:

  • service_date is self-explanatory.
  • route_id is the bus number. Note that for some buses, the internal route ID does not match the consumer-facing bus number. For instance, route 89/93 is identified as 194 internally, and the first 39 bus of the day is considered route 192 internally.
  • direction_id is the direction identifier, which appears to only ever be Outbound or Inbound.
  • half_trip_id is like the trip_id of the subway data, but since the outbound and inbound trip of a bus route often share a trip ID, this disambiguates them.
  • stop_id is the identifier of the platform at which the bus stops. Large stations like Sullivan provide many bus platforms, each with its own ID.
  • point_type specifies if this stop is a Startpoint, Midpoint, or Endpoint of the route.
  • scheduled and actual give the scheduled and actual arrival times for the bus, respectively. However, for some reason, the date format is of the form 1900-01-01T14:28:00Z... We'll dig into this more later.

Notably, not every stop is covered in this data, only the time points (places along the route with a scheduled arrival time). However, the time points are typically placed at high-traffic stops like subway stations or the start and end of the line, so they are probably useful anyway.

Link to this section Basic parsing

Let's see if we can repeat a similar "simulated journey" to what we did with the subway, but with the bus data. Suppose I'm trying to get from Harvard to Hynes Convention Center station using the 1 bus on 2025-09-20 at 11:00 AM.

Ingesting the CSV file is quite easy:

import pandas as pd
path = "MBTA_Bus_Arrival_Departure_Times_2025/MBTA-Bus-Arrival-Departure-Times_2025-09.csv"
df = pd.read_csv(path)

The first part is quite similar to simulating a subway trip. Let's try selecting all of the trips that arrived at our destination stop ID, stop 79, then find the records for those trips departing from Harvard, stop 110.

trips_to_target = df[
(df['stop_id'] == 79)
& (df['direction_id'] == "Inbound")
]
trip_ids = trips_to_target["half_trip_id"].unique()
trips_from_start = df[
(df["stop_id"] == 110)
& (df["direction_id"] == "Inbound")
& (df["half_trip_id"].isin(trip_ids))
]
Service DateRoute IDDirection IDHalf Trip IDStop IDScheduledActual
2025-09-0101Inbound680995701101900-01-01T11:31:00Z1900-01-01T11:44:24Z
2025-09-0101Inbound680995721101900-01-01T12:40:00Z1900-01-01T13:01:02Z
2025-09-0101Inbound680995731101900-01-01T22:06:00Z1900-01-01T22:54:09Z
2025-09-0101Inbound680995751101900-01-01T23:42:00Z1900-01-02T00:18:43Z

And finally, we need to filter by trips happening after our chosen start time, so we need to handle the problem of dates.

Link to this section Dates and times nonsense

How do we assemble the service_date and scheduled/actual fields into an actual timestamp like we got with the subway data?

Let's start by parsing the service date.

timestamp = datetime.datetime.fromisoformat("2025-09-01")
>>> datetime.datetime(2025, 9, 1, 0, 0)

Now, let's parse the timestamp offset. Note that it is in UTC, not local time.

offset = datetime.datetime.fromisoformat("1900-01-01T11:31:00Z")
>>> datetime.datetime(1900, 1, 1, 11, 31, tzinfo=datetime.timezone.utc)

Right now, the offset is an aware datetime, meaning it contains timezone info. Combining aware datetimes with their counterparts, naive datetimes, is usually not allowed. Let's make the service date an aware datetime as well.

You might be tempted to use the astimezone method, but this will convert the naive datetime to an aware datetime assuming the naive datetime is in local time, which is not what we want -- we want to keep the year/month/day values the same, but just attach timezone information to this instance. We can use the replace method and replace the tzinfo field.

timestamp = timestamp.replace(tzinfo=datetime.UTC)
>>> datetime.datetime(2025, 9, 1, 0, 0, tzinfo=datetime.timezone.utc)

Cool. Now the only thing we need to do with the offset is subtract the placeholder date (1900-01-01). Python implements this nicely, where subtracting one datetime from another gives a timedelta object.

offset -= datetime.datetime(1900, 1, 1, tzinfo=datetime.UTC)
>>> datetime.timedelta(seconds=41460)

Now, we can add this to our service date:

timestamp += offset
>>> datetime.datetime(2025, 9, 1, 11, 31, tzinfo=datetime.timezone.utc)

And finally, convert it from UTC to our local time, then strip the timezone info to match the behavior of our other code. If this were production code, we would want to only use aware datetimes... but we're just messing around so let's do what's easy.

import zoneinfo
timestamp = timestamp.astimezone(zoneinfo.ZoneInfo("America/New_York"))
timestamp = timestamp.replace(tzinfo=None)
>>> datetime.datetime(2025, 9, 1, 7, 31)

Cool! So that trip was at 2025-09-01 at 7:31 AM.

Link to this section Pulling it together

Okay, back to the show. We were trying to filter by trips happening after a given time. Let's add a column to our dataframe with a proper timestamp to match our other data.

import zoneinfo
def convert_timestamp(row):
if not isinstance(row["actual"], str):
return pd.NA
timestamp = datetime.datetime.fromisoformat(row["service_date"])
offset = datetime.datetime.fromisoformat(row["actual"])
timestamp = timestamp.replace(tzinfo=datetime.UTC)
offset -= datetime.datetime(1900, 1, 1, tzinfo=datetime.UTC)
timestamp += offset
timestamp = timestamp.astimezone(zoneinfo.ZoneInfo("America/New_York"))
timestamp = timestamp.replace(tzinfo=None)
return timestamp
df['timestamp'] = df.apply(convert_timestamp, axis=1)

After rebuilding trips_from_start with this new column, we can select the trips after our chosen time, choose the one that departed earliest, and see when it got to Hynes in a similar way to our subway trip simulator.

timestamp = datetime.datetime.fromisoformat("2025-09-20 11:00:00")
trips_after_time = trips_from_start[trips_from_start['timestamp'] > timestamp]
next_bus = trips_after_time.loc[trips_after_time['timestamp'].idxmin()]
bus_arrival = df[(df['half_trip_id'] == next_bus['half_trip_id']) & (df['stop_id'] == 79)]
Service DateRoute IDDirection IDHalf Trip IDStop IDTimestamp
2025-09-2001Inbound68311684792025-09-20 11:24:05

I would've made it to Hynes at 11:24 AM. Not bad!

Link to this section Simulating longer journeys

Why do we care about simulating historical journeys, you might ask? It's useful for answering a few different types of questions:

  • How early should I leave my house to make it to work on time 90% of the time?
  • How reliably, on average, can I make certain connections between modes?

And, my favorite:

  • If I wanted to visit every subway station in the MBTA as quickly as possible, what would be the fastest route?

Link to this section Simulating an existing speedrun

Transit system speedrunning is a phenomenon in which competitors attempt to travel through all stations within a system as quickly as possible.

My friend Tris completed a speedrun about a year ago and documented her route in a Mastodon thread. Let's see if we can accurately simulate it!

I made a few improvements to the algorithm:

  • Caching the processed data where possible!
  • Instead of using stop_timestamp + dwell_time_seconds, I used the move_timestamp of the station immediately after the given one. This seemed much more accurate for terminus stations where dwell_time_seconds was set to null (despite those stations being where the train dwells for the longest amount of time).

For parts of the run involving walking, I set very optimistic transfer times because I can vouch for the fact that Tris walks very fast.

The code for this is on GitHub if you want to play around with it.

StationActual TimeRouteDirectionSimulated Time
Riverside07:15Green-DEast07:15
KenmoreGreen-BWest07:48
Boston College08:18WALKWALK08:15
Cleveland Circle08:41Green-CEast08:25
Park St09:12Green-EWest09:07
Heath St09:4539Outbound09:42
Forest Hills09:58OrangeNorth09:54
Downtown Crossing10:21RedNorth10:17
Alewife10:50RedSouth10:43
Davis10:5496Outbound10:46
Medford/Tufts11:04Green-EWest11:07
East SomervilleWALKWALK11:16
Union Sq11:24Green-DWest11:21
North Sta11:36OrangeNorth11:36
Oak Grove12:05OrangeSouth12:03
HaymarketWALKWALK12:23
Bowdoin12:35BlueNorth12:28
Wonderland12:57BlueSouth12:50
StateWALKWALK13:08
Downtown Crossing13:17Red-ASouth13:13
Ashmont13:41MattapanOutbound13:38
Mattapan13:57MattapanInbound13:51
Ashmont14:08Red-ANorth14:05
JFK/UMass14:23Red-BSouth14:20
Braintree14:4714:45

Not too bad! Especially considering that the "actual time" column is based on Mastodon post timestamps so probably 1-2 minutes behind the actual timing.

Link to this section Generating the optimal route

My eventual plan with this is to use it to learn more about what makes an ideal MBTA speedrun, including the route, timing, etc. It's been a bucket list item of mine for a while to do a speedrun, and I want to see if I can find something new in terms of route, timing, etc. I think it's unlikely I'll find anything that substantial that could make a difference, but maybe it's possible...

Since running this type of simulation becomes extremely fast, there are a lot of places you could take this other than just tracing manually entered routes:

  • Taking a graph representation of the MBTA system, running a search algorithm to identify a large set of possible routes, and running simulations on each to estimate real-world timing
  • Simulating a route at various times of day, days of the week, or even months of the year to find the best time to start a run
  • Implementing logic to simulate making decisions at each step of the route based on upcoming train times (e.g., whether an Ashmont or Braintree train arrives first), to evaluate when to make those decisions

Link to this section Takeaways

I hadn't really done an in-depth data analysis project quite like this before. I definitely learned a ton!

Ingesting data into a data structure like this is something I usually consider "grunt work," but working on this showed me how wrong that assumption can be. I tried to throw a lot of this work to an LLM, only to watch it struggle against the date and timezone issues and completely miss the nuance between stop_timestamp and move_timestamp. While I got much further than it did (thankfully for the sake of my job security), it still required me to step away from the problem for a day before I could nail the accuracy.

I had also heard people constantly talk about how timezones are hard and date parsing is a mess, but had been spared the brunt of that struggle until now. I have discovered that there are a lot of bad ways to represent dates and times in software. I feel lucky that I do not need to deal with things like this more often.

The last thing I'll mention is that this project highlighted the difference between working with data in a domain where a dominant format exists (i.e., realtime GTFS), and working in a domain where implementors have no common format to use (i.e., historical transit data). I find it easy to get annoyed at format specifications not perfectly matching what I want to do... but when standardized formats work, they're pretty great!