mapgwm.swflows module

Code for preprocessing streamflow data.

mapgwm.swflows.aggregrate_values_to_stress_periods(data, perioddata, datetime_col='datetime', values_col='values', id_col='id', category_col='qualifier', keep_columns=None)[source]

Pandas sausage-making to take flow values at arbitrary times, and average them to model stress periods defined in a perioddata dataframe. Optionally, a category column identifying flow values as ‘measured’ or ‘estimated’ can also be read. Measured values are used for the averages where available; the number of measured and estimated values contributing to each average are tallied in the output.

Parameters
dataDataFrame

Input data

perioddataDataFrame

Stress Period start/end times. Must have columns:

per

int

MODFLOW Stress Period

start_datetime

str or datetime64

Period start time

end_datetime

str or datetime64

Period end time

datetime_colstr

Column in data for Measurement dates (str or datetime64)

id_col: int or str

Column in data identifying the hydrography line or measurement site associated with flow value. Valid identifiers corresponding to the source hydrography (e.g. NHDPlus COMIDs) may be needed for locating the flows within the stream network, unless x and y coordinates are available.

values_colfloat

Column in data with observed or estimated values.

category_colstr; categorical

Column in data with ‘measured’ or ‘estimated’ flags indicating how each flow value was derived. If None, ‘measured’ is used for all flows. By default, ‘qualifier’.

site_no_colstr

Optional column in data identifying the measurement site associated with each value, for example, for keeping track of measurement site numbers or names that are different than the hydrography identifiers in line_id_col.

Returns
df_perDataFrame

Stress period averages, and metadata describing the source of the averages (number of estimated vs. measured values). For each site, also includes averages and standard deviations for measurements from outside the stress periods defined in perioddata. For sites with no measurements within a stress period, an average of all other measurements is used.

Notes

This method is similar to mfsetup.tdis.aggregate_dataframe_to_stress_period in what it does (resample time series to model stress periods), and for modflow-setup, has is superceded by that method (called via TransientTabularData object). Keeping this method here though, in case we need to use it in the future. Key differences between two methods: * this method operates on the whole timeseres of model stress periods instead of a single stress period * this method allows for specification of both measured and estimated values; the number of estimated and measured values contributing to each average are included in the output * duplicate sites in mfsetup.tdis.aggregate_dataframe_to_stress_period (for example, to handle wells located in the same model cell) can be aggregated with sum, mean, first, etc., or by raising an error. In this method, the duplicate site with the most measurements (vs. estimates) is retained. * this method fills stress periods without measurements using the mean values for all time.

mapgwm.swflows.combine_measured_estimated_values(measured_values, estimated_values, measured_values_data_col, estimated_values_data_col, dest_values_col='obsval', resample_freq='MS', how='mean')[source]

Combine time series of measured and estimated values for multiple sites, giving preference to measured values.

Parameters
measured_valuescsv file or DataFrame

Time series of measured values at multiple sites, similar to that output by preprocess_flows().

Columns:

site_no

site identifiers; read-in as strings

datetime

measurement dates/times

data columns

columns with floating-point data to combine

estimated_valuescsv file or DataFrame

Time series of measured values at multiple sites, similar to that output by preprocess_flows().

Columns:

site_no

site identifiers; read-in as strings

datetime

measurement dates/times

data columns

columns with floating-point data to combine

measured_values_data_colstr

Column in measured_values with data to combine.

estimated_values_data_colstr

Column in estimated_values with data to combine.

dest_values_colstr

Output column with combined data from measured_values_data_col and estimated_values_data_col, by default ‘obsval’

resample_freqstr or DateOffset

Any pandas frequency alias The data columns in measured_values and estimated_values are resampled to this fequency using the method specified by how. By default, ‘MS’ (month-start)

howstr

Resample method. Can be any of the method calls on the pandas Resampler object. By default, ‘mean’

Returns
combinedDataFrame

DataFrame containing all columns from estimated_values, the data columns from measured_values, and a dest_values_col consisting of measured values where present, and estimated values otherwise. An "est_" prefix is added to the estimated data columns, and a "meas" prefix is added to the measured data columns.

Example:

site_no

datetime

category

est_qbase_m3d

meas_qbase_m3d

obsval

07288000

2017-10-01

measured

47872.1

28438.7

28438.7

07288000

2017-11-01

measured

47675.9

24484.5

24484.5

Where category denotes whether the value in obsval is measured or estimated.

Notes

All columns with a floating-point dtype are identified as “Data columns,” and are resampled as specified by the resample_freq and how arguments. For all other columns, the first value for each time at each site is used. The resampled measured data are joined to the resampled estimated data on the basis of site numbers and times (as a pandas MultiIndex).

mapgwm.swflows.format_site_ids(iterable, add_leading_zeros=False)[source]

Cast site ids to strings

mapgwm.swflows.format_usgs_sw_site_id(stationID)[source]

Add leading zeros to NWIS surface water sites, if they are missing. See https://help.waterdata.usgs.gov/faq/sites/do-station-numbers-have-any-particular-meaning. Zeros are only added to numeric site numbers less than 15 characters in length.

mapgwm.swflows.preprocess_flows(data, metadata=None, flow_data_columns=['flow'], start_date=None, active_area=None, active_area_id_column=None, active_area_feature_id=None, source_crs=4269, dest_crs=5070, datetime_col='datetime', site_no_col='site_no', line_id_col='line_id', x_coord_col='x', y_coord_col='y', name_col='name', flow_qualifier_column=None, default_qualifier='measured', include_sites=None, include_line_ids=None, source_volume_units='ft3', source_time_units='s', dest_volume_units='m3', dest_time_units='d', geographic_groups=None, geographic_groups_col=None, max_obsname_len=None, add_leading_zeros_to_sw_site_nos=False, column_renames=None, outfile=None)[source]

Preprocess stream flow observation data, for example, from NWIS or another data source that outputs time series in CSV format with site locations and identifiers.

  • Data are reprojected from a source_crs (Coordinate reference system; assumed to be in geographic coordinates) to the CRS of the model (dest_crs)

  • Data are culled to a start_date and optionally, a polygon or set of polygons defining the model area

  • length and time units are converted to those of the groundwater model.

  • Prefixes for observation names (with an optional length limit) that identify the location are generated

  • Preliminary observation groups can also be assigned, based on geographic areas defined by polygons (geographic_groups parameter)

Parameters
datacsv file or DataFrame

Time series of stream flow observations. Columns:

site_no

site identifier

datetime

measurement dates/times

x

x-coordinate of site

y

y-coordinate of site

flow_data_columns

Columns of observed streamflow values

flow_qualifier_column

Optional column with qualifiers for flow values

Notes:

  • x and y columns can alternatively be in the metadata table

  • flow_data_columns are denoted in flow_data_columns; multiple columns can be included to process base flow and total flow, or other statistics in tandem

  • For example, flow_qualifier_column may have “estimated” or “measured” flags denoting whether streamflows were derived from measured values or statistical estimates.

metadatacsv file or DataFrame

Stream flow observation site information.

May include columns:

site_no

site identifier

x

x-coordinate of site

y

y-coordinate of site

name

name of site

line_id_col

Identifier for a line in a hydrography dataset that the site is associated with.

Notes:

  • other columns in metadata will be passed through to the metadata output

flow_data_columnslist of strings

Columns in data with flow values or their statistics. By default, [‘q_cfs’] start_date : str (YYYY-mm-dd) Simulation start date (cull observations before this date)

active_areastr

Shapefile with polygon to cull observations to. Automatically reprojected to dest_crs if the shapefile includes a .prj file. by default, None.

active_area_id_columnstr, optional

Column in active_area with feature ids. By default, None, in which case all features are used.

active_area_feature_idstr, optional

ID of feature to use for active area By default, None, in which case all features are used.

source_crsobj

Coordinate reference system of the head observation locations. A Python int, dict, str, or pyproj.crs.CRS instance passed to pyproj.crs.CRS.from_user_input()

Can be any of:
  • PROJ string

  • Dictionary of PROJ parameters

  • PROJ keyword arguments for parameters

  • JSON string with PROJ parameters

  • CRS WKT string

  • An authority string [i.e. ‘epsg:4326’]

  • An EPSG integer code [i.e. 4326]

  • A tuple of (“auth_name”: “auth_code”) [i.e (‘epsg’, ‘4326’)]

  • An object with a to_wkt method.

  • A pyproj.crs.CRS class

By default, epsg:4269

dest_crsobj

Coordinate reference system of the model. Same input types as source_crs. By default, epsg:5070

datetime_colstr, optional

Column name in data with observation date/times, by default ‘datetime’

site_no_colstr, optional

Column name in data and metadata with site identifiers, by default ‘site_no’

line_id_colstr, optional

Column name in data or metadata with identifiers for hydrography lines associated with observation sites. by default ‘line_id’

x_coord_colstr, optional

Column name in data or metadata with x-coordinates, by default ‘x’

y_coord_colstr, optional

Column name in data or metadata with y-coordinates, by default ‘y’

name_colstr, optional

Column name in data or metadata with observation site names, by default ‘name’

flow_qualifier_columnstr, optional

Column name in data with flow observation qualifiers, such as “measured” or “estimated” by default ‘category’

default_qualifierstr, optional

Default qualifier to populate flow_qualifier_column if it is None. By default, “measured”

include_siteslist-like, optional

Exclude output to these sites. by default, None (include all sites)

include_line_idslist-like, optional

Exclude output to these sites, represented by line identifiers. by default, None (include all sites)

source_volume_unitsstr, ‘m3’, ‘cubic meters’, ‘ft3’, etc.

Volume units of the source data. By default, ‘ft3’

source_time_unitsstr, ‘s’, ‘seconds’, ‘days’, etc.

Time units of the source data. By default, ‘s’

dest_volume_unitsstr, ‘m3’, ‘cubic meters’, ‘ft3’, etc.

Volume units of the output (model). By default, ‘m3’

dest_time_unitsstr, ‘s’, ‘seconds’, ‘days’, etc.

Time units of the output (model). By default, ‘d’

geographic_groupsfile, dict or list-like

Option to group observations by area(s) of interest. Can be a shapefile, list of shapefiles, or dictionary of shapely polygons. A ‘group’ column will be created in the metadata, and observation sites within each polygon will be assigned the group name associated with that polygon.

For example:

geographic_groups='../source_data/extents/CompositeHydrographArea.shp'
geographic_groups=['../source_data/extents/CompositeHydrographArea.shp']
geographic_groups={'cha': <shapely Polygon>}

Where ‘cha’ is an observation group name for observations located within the the area defined by CompositeHydrographArea.shp. For shapefiles, group names are provided in a geographic_groups_col.

geographic_groups_colstr

Field name in the geographic_groups shapefile(s) containing the observation group names associated with each polygon.

max_obsname_lenint or None

Maximum length for observation name prefix. Default of 13 allows for a PEST obsnme of 20 characters or less with <prefix>_yyyydd or <prefix>_<per>d<per> (e.g. <prefix>_2d1 for a difference between stress periods 2 and 1) If None, observation names will not be truncated. PEST++ does not have a limit on observation name length.

add_leading_zeros_to_sw_site_nosbool

Whether or not to pad site numbers using the :func:~`mapgwm.swflows.format_usgs_sw_site_id` function. By default, False.

column_renamesdict, optional

Option to rename columns in the data or metadata that are different than those listed above. For example, if the data file has a ‘SITE_NO’ column instead of ‘SITE_BADGE’:

column_renames={'SITE_NO': 'site_no'}

by default None, in which case the renames listed above will be used. Note that the renames must be the same as those listed above for mapgwm.swflows.preprocess_flows() to work.

outfilestr

Where output file will be written. Metadata are written to a file with the same name, with an additional “_info” suffix prior to the file extension.

Returns
dataDataFrame

Preprocessed time series

metadataDataFrame

Preprocessed metadata

References

The PEST++ Manual <https://github.com/usgs/pestpp/tree/master/documentation>

mapgwm.swflows.resample_group_timeseries(df, resample_freq='MS', how='mean', add_data_prefix=None)[source]

Resample a DataFrame with both groups (e.g. measurement sites) and time series (measurements at each site).

Parameters
dfDataFrame

Time series of values at multiple sites, similar to that output by preprocess_flows().

Columns:

site_no

site identifiers; read-in as strings

datetime

measurement dates/times

data columns

columns with floating-point data to resample

resample_freqstr or DateOffset

Any pandas frequency alias The data columns in measured_values and estimated_values are resampled to this fequency using the method specified by how. By default, ‘MS’ (month-start)

howstr

Resample method. Can be any of the method calls on the pandas Resampler object. By default, ‘mean’

add_data_prefixstr

Option to add prefix to data columns. By default, None

Returns
resampledDataFrame

Resampled data at each site.

Notes

All columns with a floating-point dtype are identified as “Data columns,” and are resampled as specified by the resample_freq and how arguments. For all other columns, the first value for each time at each site is used.