mapgwm.swflows module¶
Code for preprocessing streamflow data.
-
mapgwm.swflows.
aggregrate_values_to_stress_periods
(data, perioddata, datetime_col='datetime', values_col='values', id_col='id', category_col='qualifier', keep_columns=None)[source]¶ Pandas sausage-making to take flow values at arbitrary times, and average them to model stress periods defined in a perioddata dataframe. Optionally, a category column identifying flow values as ‘measured’ or ‘estimated’ can also be read. Measured values are used for the averages where available; the number of measured and estimated values contributing to each average are tallied in the output.
- Parameters
- dataDataFrame
Input data
- perioddataDataFrame
Stress Period start/end times. Must have columns:
per
int
MODFLOW Stress Period
start_datetime
str or datetime64
Period start time
end_datetime
str or datetime64
Period end time
- datetime_colstr
Column in data for Measurement dates (str or datetime64)
- id_col: int or str
Column in data identifying the hydrography line or measurement site associated with flow value. Valid identifiers corresponding to the source hydrography (e.g. NHDPlus COMIDs) may be needed for locating the flows within the stream network, unless x and y coordinates are available.
- values_colfloat
Column in data with observed or estimated values.
- category_colstr; categorical
Column in data with ‘measured’ or ‘estimated’ flags indicating how each flow value was derived. If None, ‘measured’ is used for all flows. By default, ‘qualifier’.
- site_no_colstr
Optional column in data identifying the measurement site associated with each value, for example, for keeping track of measurement site numbers or names that are different than the hydrography identifiers in line_id_col.
- Returns
- df_perDataFrame
Stress period averages, and metadata describing the source of the averages (number of estimated vs. measured values). For each site, also includes averages and standard deviations for measurements from outside the stress periods defined in perioddata. For sites with no measurements within a stress period, an average of all other measurements is used.
Notes
This method is similar to mfsetup.tdis.aggregate_dataframe_to_stress_period in what it does (resample time series to model stress periods), and for modflow-setup, has is superceded by that method (called via TransientTabularData object). Keeping this method here though, in case we need to use it in the future. Key differences between two methods: * this method operates on the whole timeseres of model stress periods instead of a single stress period * this method allows for specification of both measured and estimated values; the number of estimated and measured values contributing to each average are included in the output * duplicate sites in mfsetup.tdis.aggregate_dataframe_to_stress_period (for example, to handle wells located in the same model cell) can be aggregated with sum, mean, first, etc., or by raising an error. In this method, the duplicate site with the most measurements (vs. estimates) is retained. * this method fills stress periods without measurements using the mean values for all time.
-
mapgwm.swflows.
combine_measured_estimated_values
(measured_values, estimated_values, measured_values_data_col, estimated_values_data_col, dest_values_col='obsval', resample_freq='MS', how='mean')[source]¶ Combine time series of measured and estimated values for multiple sites, giving preference to measured values.
- Parameters
- measured_valuescsv file or DataFrame
Time series of measured values at multiple sites, similar to that output by
preprocess_flows()
.Columns:
site_no
site identifiers; read-in as strings
datetime
measurement dates/times
data columns
columns with floating-point data to combine
- estimated_valuescsv file or DataFrame
Time series of measured values at multiple sites, similar to that output by
preprocess_flows()
.Columns:
site_no
site identifiers; read-in as strings
datetime
measurement dates/times
data columns
columns with floating-point data to combine
- measured_values_data_colstr
Column in measured_values with data to combine.
- estimated_values_data_colstr
Column in estimated_values with data to combine.
- dest_values_colstr
Output column with combined data from measured_values_data_col and estimated_values_data_col, by default ‘obsval’
- resample_freqstr or DateOffset
Any pandas frequency alias The data columns in measured_values and estimated_values are resampled to this fequency using the method specified by
how
. By default, ‘MS’ (month-start)- howstr
Resample method. Can be any of the method calls on the pandas Resampler object. By default, ‘mean’
- Returns
- combinedDataFrame
DataFrame containing all columns from estimated_values, the data columns from measured_values, and a dest_values_col consisting of measured values where present, and estimated values otherwise. An
"est_"
prefix is added to the estimated data columns, and a"meas"
prefix is added to the measured data columns.Example:
site_no
datetime
category
est_qbase_m3d
meas_qbase_m3d
obsval
07288000
2017-10-01
measured
47872.1
28438.7
28438.7
07288000
2017-11-01
measured
47675.9
24484.5
24484.5
Where
category
denotes whether the value in obsval is measured or estimated.
Notes
All columns with a floating-point dtype are identified as “Data columns,” and are resampled as specified by the resample_freq and how arguments. For all other columns, the first value for each time at each site is used. The resampled measured data are joined to the resampled estimated data on the basis of site numbers and times (as a pandas MultiIndex).
-
mapgwm.swflows.
format_usgs_sw_site_id
(stationID)[source]¶ Add leading zeros to NWIS surface water sites, if they are missing. See https://help.waterdata.usgs.gov/faq/sites/do-station-numbers-have-any-particular-meaning. Zeros are only added to numeric site numbers less than 15 characters in length.
-
mapgwm.swflows.
preprocess_flows
(data, metadata=None, flow_data_columns=['flow'], start_date=None, active_area=None, active_area_id_column=None, active_area_feature_id=None, source_crs=4269, dest_crs=5070, datetime_col='datetime', site_no_col='site_no', line_id_col='line_id', x_coord_col='x', y_coord_col='y', name_col='name', flow_qualifier_column=None, default_qualifier='measured', include_sites=None, include_line_ids=None, source_volume_units='ft3', source_time_units='s', dest_volume_units='m3', dest_time_units='d', geographic_groups=None, geographic_groups_col=None, max_obsname_len=None, add_leading_zeros_to_sw_site_nos=False, column_renames=None, outfile=None)[source]¶ Preprocess stream flow observation data, for example, from NWIS or another data source that outputs time series in CSV format with site locations and identifiers.
Data are reprojected from a source_crs (Coordinate reference system; assumed to be in geographic coordinates) to the CRS of the model (dest_crs)
Data are culled to a start_date and optionally, a polygon or set of polygons defining the model area
length and time units are converted to those of the groundwater model.
Prefixes for observation names (with an optional length limit) that identify the location are generated
Preliminary observation groups can also be assigned, based on geographic areas defined by polygons (geographic_groups parameter)
- Parameters
- datacsv file or DataFrame
Time series of stream flow observations. Columns:
site_no
site identifier
datetime
measurement dates/times
x
x-coordinate of site
y
y-coordinate of site
flow_data_columns
Columns of observed streamflow values
flow_qualifier_column
Optional column with qualifiers for flow values
Notes:
x and y columns can alternatively be in the metadata table
flow_data_columns are denoted in flow_data_columns; multiple columns can be included to process base flow and total flow, or other statistics in tandem
For example, flow_qualifier_column may have “estimated” or “measured” flags denoting whether streamflows were derived from measured values or statistical estimates.
- metadatacsv file or DataFrame
Stream flow observation site information.
May include columns:
site_no
site identifier
x
x-coordinate of site
y
y-coordinate of site
name
name of site
line_id_col
Identifier for a line in a hydrography dataset that the site is associated with.
Notes:
other columns in metadata will be passed through to the metadata output
- flow_data_columnslist of strings
Columns in data with flow values or their statistics. By default, [‘q_cfs’] start_date : str (YYYY-mm-dd) Simulation start date (cull observations before this date)
- active_areastr
Shapefile with polygon to cull observations to. Automatically reprojected to dest_crs if the shapefile includes a .prj file. by default, None.
- active_area_id_columnstr, optional
Column in active_area with feature ids. By default, None, in which case all features are used.
- active_area_feature_idstr, optional
ID of feature to use for active area By default, None, in which case all features are used.
- source_crsobj
Coordinate reference system of the head observation locations. A Python int, dict, str, or
pyproj.crs.CRS
instance passed topyproj.crs.CRS.from_user_input()
- Can be any of:
PROJ string
Dictionary of PROJ parameters
PROJ keyword arguments for parameters
JSON string with PROJ parameters
CRS WKT string
An authority string [i.e. ‘epsg:4326’]
An EPSG integer code [i.e. 4326]
A tuple of (“auth_name”: “auth_code”) [i.e (‘epsg’, ‘4326’)]
An object with a to_wkt method.
A
pyproj.crs.CRS
class
By default, epsg:4269
- dest_crsobj
Coordinate reference system of the model. Same input types as
source_crs
. By default, epsg:5070- datetime_colstr, optional
Column name in data with observation date/times, by default ‘datetime’
- site_no_colstr, optional
Column name in data and metadata with site identifiers, by default ‘site_no’
- line_id_colstr, optional
Column name in data or metadata with identifiers for hydrography lines associated with observation sites. by default ‘line_id’
- x_coord_colstr, optional
Column name in data or metadata with x-coordinates, by default ‘x’
- y_coord_colstr, optional
Column name in data or metadata with y-coordinates, by default ‘y’
- name_colstr, optional
Column name in data or metadata with observation site names, by default ‘name’
- flow_qualifier_columnstr, optional
Column name in data with flow observation qualifiers, such as “measured” or “estimated” by default ‘category’
- default_qualifierstr, optional
Default qualifier to populate flow_qualifier_column if it is None. By default, “measured”
- include_siteslist-like, optional
Exclude output to these sites. by default, None (include all sites)
- include_line_idslist-like, optional
Exclude output to these sites, represented by line identifiers. by default, None (include all sites)
- source_volume_unitsstr, ‘m3’, ‘cubic meters’, ‘ft3’, etc.
Volume units of the source data. By default, ‘ft3’
- source_time_unitsstr, ‘s’, ‘seconds’, ‘days’, etc.
Time units of the source data. By default, ‘s’
- dest_volume_unitsstr, ‘m3’, ‘cubic meters’, ‘ft3’, etc.
Volume units of the output (model). By default, ‘m3’
- dest_time_unitsstr, ‘s’, ‘seconds’, ‘days’, etc.
Time units of the output (model). By default, ‘d’
- geographic_groupsfile, dict or list-like
Option to group observations by area(s) of interest. Can be a shapefile, list of shapefiles, or dictionary of shapely polygons. A ‘group’ column will be created in the metadata, and observation sites within each polygon will be assigned the group name associated with that polygon.
For example:
geographic_groups='../source_data/extents/CompositeHydrographArea.shp' geographic_groups=['../source_data/extents/CompositeHydrographArea.shp'] geographic_groups={'cha': <shapely Polygon>}
Where ‘cha’ is an observation group name for observations located within the the area defined by CompositeHydrographArea.shp. For shapefiles, group names are provided in a geographic_groups_col.
- geographic_groups_colstr
Field name in the geographic_groups shapefile(s) containing the observation group names associated with each polygon.
- max_obsname_lenint or None
Maximum length for observation name prefix. Default of 13 allows for a PEST obsnme of 20 characters or less with <prefix>_yyyydd or <prefix>_<per>d<per> (e.g. <prefix>_2d1 for a difference between stress periods 2 and 1) If None, observation names will not be truncated. PEST++ does not have a limit on observation name length.
- add_leading_zeros_to_sw_site_nosbool
Whether or not to pad site numbers using the :func:~`mapgwm.swflows.format_usgs_sw_site_id` function. By default, False.
- column_renamesdict, optional
Option to rename columns in the data or metadata that are different than those listed above. For example, if the data file has a ‘SITE_NO’ column instead of ‘SITE_BADGE’:
column_renames={'SITE_NO': 'site_no'}
by default None, in which case the renames listed above will be used. Note that the renames must be the same as those listed above for
mapgwm.swflows.preprocess_flows()
to work.- outfilestr
Where output file will be written. Metadata are written to a file with the same name, with an additional “_info” suffix prior to the file extension.
- Returns
- dataDataFrame
Preprocessed time series
- metadataDataFrame
Preprocessed metadata
References
The PEST++ Manual <https://github.com/usgs/pestpp/tree/master/documentation>
-
mapgwm.swflows.
resample_group_timeseries
(df, resample_freq='MS', how='mean', add_data_prefix=None)[source]¶ Resample a DataFrame with both groups (e.g. measurement sites) and time series (measurements at each site).
- Parameters
- dfDataFrame
Time series of values at multiple sites, similar to that output by
preprocess_flows()
.Columns:
site_no
site identifiers; read-in as strings
datetime
measurement dates/times
data columns
columns with floating-point data to resample
- resample_freqstr or DateOffset
Any pandas frequency alias The data columns in measured_values and estimated_values are resampled to this fequency using the method specified by
how
. By default, ‘MS’ (month-start)- howstr
Resample method. Can be any of the method calls on the pandas Resampler object. By default, ‘mean’
- add_data_prefixstr
Option to add prefix to data columns. By default, None
- Returns
- resampledDataFrame
Resampled data at each site.
Notes
All columns with a floating-point dtype are identified as “Data columns,” and are resampled as specified by the resample_freq and how arguments. For all other columns, the first value for each time at each site is used.