

To install the thermostat package for the first time, we highly recommend that you create a virtual environment or a conda environment in which to install it. You may choose to skip this step, but do so at the risk of corrupting your existing python environment. Isolating your python environment will also make it easier to debug.

# if using virtualenvwrapper (see https://virtualenvwrapper.readthedocs.org/en/latest/install.html)
$ mkvirtualenv thermostat
(thermostat)$ pip install thermostat

# if using conda (see note below - conda is distributed with Anaconda)
$ conda create --yes --name thermostat pandas
(thermostat)$ pip install thermostat

If you already have an environment, use the following:

# if using virtualenvwrapper
$ workon thermostat

# if using conda
$ source activate thermostat

To deactivate the environment when you’ve finished, use the following:

# if using virtualenvwrapper
(thermostat)$ deactivate

# if using conda
(thermostat)$ source deactivate

Check to make sure you are on the most recent version of the package.

>>> import thermostat; thermostat.get_version()

If you are not on the correct version, you should upgrade:

$ pip install thermostat --upgrade

The command above will update dependencies as well. If you wish to skip this, use the --no-deps flag:

$ pip install thermostat --upgrade --no-deps

Previous versions of the package are available on github.


If you experience issues installing python packages with C extensions, such as numpy or scipy, we recommend installing and using the free Anaconda Python distribution by Continuum Analytics. It contains many of the numeric and scientific packages used by this package and has installers for Python 2.7 and 3.5 for Windows, Mac OS X and Linux.

Once you have verified a correct installation, import the necessary methods and set a directory for finding and storing data.


If you suspect a package version conflict or error, you can verify the versions of the packages you have installed against the package versions in thermostatreqnotes.txt.

To list your package versions, use:

$ pip freeze

or (if you’re using Anaconda):

$ conda list

Script setup and imports

Import the few built-in python packages and methods we will be using in this tutorial as follows.

import sys
import os
import warnings
from os.path import expanduser

Also make sure to import the methods we will be using from the thermostat package.

from thermostat.importers import from_csv
from thermostat.exporters import metrics_to_csv
from thermostat.stats import compute_summary_statistics
from thermostat.stats import summary_statistics_to_csv

Set the data_dir variable as a convenience. We will refer to this directory and save our results in it. You should also move all downloaded and extracted files used in this tutorial into this directory before using them. You may, of course, choose to use a different directory, which you can set here, or override it entirely by replacing it where it appears in the tutorial.

data_dir = os.path.join(expanduser("~"), "thermostat_tutorial")
# or data_dir = "/full/path/to/custom/directory/"

Optional Setup

If you wish to follow the progress of downloading and caching external weather files, which will be the most time-consuming portion of this tutorial, you may wish at this point to configure logging. The example here will work within most ipython or script environments. If you have a more complicated logging setup, you may need to use something other than the root logger, which this uses.

import logging
logger = logging.getLogger()


The thermostat package depends on the eemeter package for weather data fetching. The eemeter package automatically creates its own cache directory in which it keeps cached versions of weather source data. This speeds up the (generally I/O bound) NOAA weather fetching routine on subsequent internal calls to fetch the same weather data (i.e. getting outdoor temperature data for thermostats that map to the same weather station).

For more information, see the eemeter package.


US Census Bureau ZIP Code Tabulation Areas (ZCTA) are used to map USPS ZIP codes to outdoor temperature data. If the automatic mapping is unsuccessful for one or more of the ZIP codes in your dataset, the reason is likely to be the discrepancy between “true” USPS ZIP codes and the US Census Bureau ZCTAs. “True” ZIP codes are not used because they do not always map well to location (for example, ZIP codes for P.O. boxes). You may need to first map ZIP codes to ZCTAs, or these thermostats will be skipped. There are roughly 32,000 ZCTAs and roughly 42000 ZIP codes - many fewer ZCTAs than ZIP codes.

Computing individual thermostat-season metrics

After importing the package methods, load the example thermostat data, or provide data of your own. See Input data for more detailed file format information.

Fabricated example data from 35 thermostats in various climate zones, is available for download here.

Loading the thermostat data below will take more than a few minutes, even if the weather cache is enabled (see note above). This is because loading thermostat data involves downloading hourly weather data from a remote source - in this case, the NCDC.

The following loads an lazy iterator over the thermostats. The thermostats will be loaded into memory as necessary in the following steps.

metadata_filename = os.path.join(data_dir, "examples/metadata.csv")
thermostats = from_csv(metadata_filename, verbose=True)

To calculate savings metrics, iterate through thermostats and save the results. Uncomment the commented lines if you would like to store the thermostats in memory for inspection. Note that this could eat up your application memory and is only recommended for debugging purposes.

metrics = []
# saved_thermostats = []
for thermostat in thermostats:
    outputs = thermostat.calculate_epa_field_savings_metrics()
    # saved_thermostats.append(thermostat)

The single-thermostat metrics should be output to CSV and converted to dataframe format.

output_filename = os.path.join(data_dir, "thermostat_example_output.csv")
metrics_df = metrics_to_csv(metrics, output_filename)

The output CSV will be saved in your data directory and should very nearly match the output CSV provided in the example data.

See Output data for more detailed file format information.

Computing summary statistics

Once you have obtained output for each individual thermostat in your dataset, use the stats module to compute summary statistics, which are formatted for submission to the EPA. The example below works with the output file from the tutorial above and can be modified to use your data.

Compute statistics across all thermostats.

# uses the metrics_df created in the Quickstart above.
with warnings.catch_warnings():

    # uses the metrics_df created in the quickstart above.
    stats = compute_summary_statistics(metrics_df)

    # If you want to have advanced filter outputs, use this instead
    # stats_advanced = compute_summary_statistics(metrics_df, advanced_filtering=True)

Save these results to file.

Each row of the saved CSV will represent one type of output, with one row per statistic per output. Each column in the CSV will represent one subset of thermostats, as determined by grouping by EIC climate zone and applying various filtering methods. National weighted averages will be available near the top of the file.

At this point, you will also need to provide an alphanumeric product identifier for the connected thermostat; e.g. a combination of the connected thermostat service plus one or more connected thermostat device models that comprises the data set.

stats_filepath = os.path.join(data_dir, "thermostat_example_stats.csv")
stats_df = summary_statistics_to_csv(stats, stats_filepath, product_id)

# or with advanced filter outputs
# stats_advanced_filepath = os.path.join(data_dir, "thermostat_example_stats_advanced.csv")
# stats_advanced_df = summary_statistics_to_csv(stats_advanced, stats_advanced_filepath, product_id)

National savings are computed by weighted average of percent savings results grouped by climate zone. Heavier weights are applied to results in climate zones which, regionally, tend to have longer runtimes. Weightings used are available for download.

More information

For additional information on package usage, please see the API documentation.

Input data

Input data should be specified using the following formats. One CSV should specify thermostat summary metadata (e.g. unique identifiers, location, etc.). Another CSV (or CSVs) should contain runtime information, linked to the metadata csv by the thermostat_id column.

Example files here.

Thermostat Summary Metadata CSV format


Name Data Format Units Description
thermostat_id string N/A A uniquely identifying marker for the thermostat.
equipment_type enum, {0..5} N/A The type of controlled HVAC heating and cooling equipment. [1]
zipcode string, 5 digits N/A The ZIP code in which the thermostat is installed [2].
utc_offset string N/A The UTC offset of the times in the corresponding interval data CSV. (e.g. “-0700”)
interval_data_filename string N/A The filename of the interval data file corresponding to this thermostat. Should be specified relative to the location of the metadata file.
  • Each row should correspond to a single thermostat.
  • Nulls should be specified by leaving the field blank.
  • All interval data for a particular thermostat should use the same, single UTC offset provided in the metadata file.

Thermostat Interval Data CSV format


Name Data Format Units Description
thermostat_id string N/A Uniquely identifying marker for the thermostat.
date YYYY-MM-DD (ISO-8601) N/A Date of this set of readings.
cool_runtime decimal or integer minutes Daily runtime of cooling equipment.
heat_runtime decimal or integer minutes Daily runtime of heating equipment. [3]
auxiliary_heat_HH decimal or integer minutes Hourly runtime of auxiliary heat equipment (HH=00-23).
emergency_heat_HH decimal or integer minutes Hourly runtime of emergency heat equipment (HH=00-23).
temp_in_HH decimal, to nearest 0.5 °F Hourly average conditioned space temperature over the period of the reading (HH=00-23).
heating_setpoint_HH decimal, to nearest 0.5 °F Hourly average thermostat setpoint temperature over the period of the reading (HH=00-23).
cooling_setpoint_HH decimal, to nearest 0.5 °F Hourly average thermostat setpoint temperature over the period of the reading (HH=00-23).
  • Each row should correspond to a single daily reading from a thermostat.
  • Nulls should be specified by leaving the field blank.
  • Zero values should be specified as 0, rather than as blank.
  • If data is missing for a particular row of one column, data should still be provided for other columns in that row. For example, if runtime is missing for a particular date, please still provide indoor conditioned space temperature and setpoints for that date, if available.
  • Runtimes should be less than or equal to 1440 min (1 day).
  • Dates should be specified in the ISO 8601 date format (e.g. 2015-05-19).
  • All temperatures should be specified in °F (to the nearest 0.5°F).
  • If no distinction is made between heating and cooling setpoint, set both equal to the single setpoint.
  • All runtime data MUST have the same UTC offset, as provided in the corresponding metadata file.
  • If only a single setpoint is used for the thermostat, please copy the same setpoint data in to the heating and cooling setpoint columns.
  • Outdoor temperature data need not be provided - it will be fetched automatically from NCDC using the eemeter package package.
  • Dates should be consecutive.

Options for equipment_type:

  • 0: Other – e.g. multi-zone multi-stage, modulating. Note: module will not output savings data for this type.
  • 1: Single stage heat pump with electric resistance aux and/or emergency heat (i.e., strip heat)
  • 2: Single stage heat pump without additional and/or supplemental heating sources (excludes aux/emergency heat as well as dual fuel systems, i.e., heat pump plus gas- or oil-fired furnace)
  • 3: Single stage non heat pump with single-stage central air conditioning
  • 4: Single stage non heat pump without central air conditioning
  • 5: Single stage central air conditioning without central heating
[2]Will be used for matching with a weather station that provides external dry-bulb temperature data. This temperature data will be used to determine the bounds of the heating and cooling season over which metrics will be computed. For more information on the mapping between ZIP codes and weather stations, please see eemeter.weather.location.
[3]Should not include runtime for auxiliary or emergency heat - this should be provided separately in the columns emergency_heat_HH and auxiliary_heat_HH.

Output data

Individual thermostat-season

The following columns are a intermediate output generated for each thermostat-season.


Name Data Format Units Description
General outputs      
sw_version string N/A Software version.
ct_identifier string N/A Identifier for thermostat as provided in the metadata file.
equipment_type enum {0..5} N/A Equipment type of this thermostat (1, 2, 3, 4, or 5).
heating_or_cooling string N/A Label for the core day set (e.g. ‘heating_2012-2013’).
zipcode string, 5 digits N/A ZIP code provided in the metadata file.
station string, USAF ID N/A USAF identifier for station used to fetch hourly temperature data.
climate_zone string N/A EIC climate zone (consolidated).
start_date date ISO-8601 Earliest date in input file.
end_zone date ISO-8601 Latest date in input file.
n_days_both_heating_and_cooling integer # days Number of days not included as core days due to presence of both heating and cooling.
n_days_insufficient_data integer # days Number of days not included as core days due to missing data.
n_core_cooling_days integer # days Number of days meeting criteria for inclusion in core cooling day set.
n_core_heating_days integer # days Number of days meeting criteria for inclusion in core heating day set.
n_days_in_inputfile_date_range integer # days Number of potential days in inputfile date range.
baseline10_core_cooling_comfort_temperature float °F Baseline comfort temperature as determined by 10th percentile of indoor temperatures.
baseline90_core_cooling_comfort_temperature float °F Baseline comfort temperature as determined by 90th percentile of indoor temperatures.
regional_average_baseline_cooling_comfort_temperature float °F Baseline comfort temperature as determined by regional average.
regional_average_baseline_heating_comfort_temperature float °F Baseline comfort temperature as determined by regional average.
Model outputs      
percent_savings_baseline_percentile float percent Percent savings as given by hourly average CTD or HTD method with 10th or 90th percentile baseline
avoided_daily_mean_core_day_runtime_baseline_percentile float minutes Avoided average daily runtime for core cooling days
avoided_total_core_day_runtime_baseline_percentile float minutes Avoided total runtime for core cooling days
baseline_daily_mean_core_day_runtime_baseline_percentile float minutes Baseline average daily runtime for core cooling days
baseline_total_core_day_runtime_baseline_percentile float minutes Baseline total runtime for core cooling days
percent_savings_baseline_regional float percent Percent savings as given by hourly average CTD or HTD method with 10th or 90th percentile regional baseline
avoided_daily_mean_core_day_runtime_baseline_regional float minutes Avoided average daily runtime for core cooling days
avoided_total_core_day_runtime_baseline_regional float minutes Avoided total runtime for core cooling days
baseline_daily_mean_core_day_runtime_baseline_regional float minutes Baseline average daily runtime for core cooling days
baseline_total_core_day_runtime_baseline_regional float minutes Baseline total runtime for core cooling days
mean_demand float °F Average cooling demand
alpha float minutes/Δ°F The fitted slope of cooling runtime to demand regression
tau float °F The fitted intercept of cooling runtime to demand regression
mean_sq_err float N/A Mean squared error of regression
root_mean_sq_err float N/A Root mean squared error of regression
cv_root_mean_sq_err float N/A Coefficient of variation of root mean squared error of regression
mean_abs_err float N/A Mean absolute error
mean_abs_pct_err float N/A Mean absolute percent error
Runtime outputs      
total_core_cooling_runtime float minutes Total core cooling equipment runtime
total_core_heating_runtime float minutes Total core heating equipment runtime
total_auxiliary_heating_core_day_runtime float minutes Total core auxiliary heating equipment runtime
total_emergency_heating_core_day_runtime float minutes Total core emergency heating equipment runtime
daily_mean_core_cooling_runtime float minutes Average daily core cooling runtime
daily_mean_core_heating_runtime float minutes Average daily core cooling runtime
Resistance heat outputs      
rhu_00F_to_05F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(0 \leq T_{out} < 5\)
rhu_05F_to_10F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(5 \leq T_{out} < 10\)
rhu_10F_to_15F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(10 \leq T_{out} < 15\)
rhu_15F_to_20F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(15 \leq T_{out} < 20\)
rhu_20F_to_25F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(20 \leq T_{out} < 25\)
rhu_25F_to_30F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(25 \leq T_{out} < 30\)
rhu_30F_to_35F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(30 \leq T_{out} < 35\)
rhu_35F_to_40F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(35 \leq T_{out} < 40\)
rhu_40F_to_45F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(40 \leq T_{out} < 45\)
rhu_45F_to_50F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(45 \leq T_{out} < 50\)
rhu_50F_to_55F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(50 \leq T_{out} < 55\)
rhu_55F_to_60F decmial 0.0=0%, 1.0=100% Resistance heat utilization for hourly temperature bin \(55 \leq T_{out} < 60\)

Summary Statistics

For each real- or integer-valued column (“###”) from the individual thermostat-season output, the following summary statistics are generated.

(For readability, these columns are actually rows.)


Name Description
###_n Number of samples
###_upper_bound_95_perc_conf 95% confidence upper bound on mean value
###_mean Mean value
###_lower_bound_95_perc_conf 95% confidence lower bound on mean value
###_sem Standard error of the mean
###_10q 1st decile (10th percentile, q=quantile)
###_20q 2nd decile
###_30q 3rd decile
###_40q 4th decile
###_50q 5th decile
###_60q 6th decile
###_70q 7th decile
###_80q 8th decile
###_90q 9th decile

The following general columns are also output:


Name Description
sw_version Software version
product_id Alphanumeric product identifier
n_thermostat_core_day_sets_total Number of relevant rows from thermostat module output before filtering
n_thermostat_core_day_sets_kept Number of relevant rows from thermostat module not filtered out
n_thermostat_core_day_sets_discarded Number of relevant rows from thermostat module filtered out

The following national weighted percent savings columns are also available.

National savings are computed by weighted average of percent savings results grouped by climate zone. Heavier weights are applied to results in climate zones which, regionally, tend to have longer runtimes. Weightings used are available for download.


Name Description
percent_savings_baseline_percentile_mean_national_weighted_mean National weighted mean percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q10_national_weighted_mean National weighted 10th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q20_national_weighted_mean National weighted 20th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q30_national_weighted_mean National weighted 30th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q40_national_weighted_mean National weighted 40th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q50_national_weighted_mean National weighted 50th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q60_national_weighted_mean National weighted 60th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q70_national_weighted_mean National weighted 70th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q80_national_weighted_mean National weighted 80th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_q90_national_weighted_mean National weighted 90th percentile percent savings as given by baseline_percentile method.
percent_savings_baseline_percentile_lower_bound_95_perc_conf_national_weighted_mean National weighted mean percent savings lower bound as given by a 95% confidence interval and the baseline_percentile method.
percent_savings_baseline_percentile_upper_bound_95_perc_conf_national_weighted_mean National weighted mean percent savings upper bound as given by a 95% confidence interval and the baseline_percentile method.
percent_savings_baseline_regional_mean_national_weighted_mean National weighted mean percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q10_national_weighted_mean National weighted 10th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q20_national_weighted_mean National weighted 20th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q30_national_weighted_mean National weighted 30th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q40_national_weighted_mean National weighted 40th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q50_national_weighted_mean National weighted 50th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q60_national_weighted_mean National weighted 60th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q70_national_weighted_mean National weighted 70th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q80_national_weighted_mean National weighted 80th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_q90_national_weighted_mean National weighted 90th percentile percent savings as given by baseline_regional method.
percent_savings_baseline_regional_lower_bound_95_perc_conf_national_weighted_mean National weighted mean percent savings lower bound as given by a 95% confidence interval and the baseline_regional method.
percent_savings_baseline_regional_upper_bound_95_perc_conf_national_weighted_mean National weighted mean percent savings upper bound as given by a 95% confidence interval and the baseline_regional method.