enb package

Subpackages

Submodules

enb.aanalysis module

Automatic analysis and report of pandas pandas.DataFrames (e.g., produced by enb.experiment.Experiment instances) using pyplot.

See https://miguelinux314.github.io/experiment-notebook/analyzing_data.html for detailed help.

class enb.aanalysis.Analyzer(csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: ATable

Base class for all enb analyzers.

A pandas.DataFrame instance with analysis results can be obtained by calling get_df. In addition, if render_plots is used in that function, one or more figures will be produced. What plots are generated (if any) is based on the values of the self.selected_render_modes list, which must contain only elements in self.valid_render_modes.

Data analysis is done through a surrogate enb.aanalysis.AnalyzerSummary subclass, which is used to obtain the returned analysis results. Subclasses of enb.aanalysis.Analyzer then perform any requested plotting.

Rendering is performed for all modes contained self.selected_render_modes, which must be in self.valid_render_modes.

The @enb.config.aini.managed_attributes decorator overwrites the class (“static”) properties upon definition, with values taken from .ini configuration files. The decorator can be added to any Analyzer subclass, and parameters can be managed within the full-qualified name of the class, e.g., using a “[enb.aanalysis.Analyzer]” section header in any of the .ini files detected by enb.

__init__(csv_support_path=None, column_to_properties=None, progress_report_period=None)
Parameters:
  • index – string with column name or list of column names that will be used for indexing. Indices provided to self.get_df must be either one instance (when a single column name is given) or a list of as many instances as elements are contained in self.index. See `self.indices.

  • csv_support_path – path to a file where this ATable contents are to be stored and retrieved. If None, persistence is disabled.

  • column_to_properties – if not None, it is a mapping from strings to callables that defines the columns of the table and how to obtain the cell values

  • progress_report_period – if not None, it must be a positive number of seconds that are waited between progress report messages (if applicable).

classmethod adjust_common_row_axes(column_kwargs, column_selection, render_mode, summary_df)

When self.common_group_scale is True, this method is called to make all groups (rows) use the same scale.

build_summary_atable(full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Build a enb.aanalysis.AnalyzerSummary instance with the appropriate columns to perform the intended analysis. See enb.aanalysis.AnalyzerSummary for documentation on the meaning of each argument.

Parameters:
  • full_df – dataframe instance being analyzed

  • target_columns – list of columns specified for analysis

  • reference_group – if not None, the column or group to be used as baseline in the analysis

  • include_all_group – force inclusion of an “All” group with all samples

  • render_kwargs – additional keyword arguments passed to get_df for adjusting the rendering process. They can be used, but are not typically needed, for implementing the summary table’s methods. See enb.aanalysis.get_df for details.

Returns:

the built summary table, without having called its get_df method.

column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

combine_groups = False
common_group_scale = True
fig_height = 4.0
fig_width = 5.0
get_df(full_df, target_columns, selected_render_modes=None, output_plot_dir=None, group_by=None, reference_group=None, column_to_properties=None, show_global=None, show_count=None, **render_kwargs)

Analyze a pandas.DataFrame instance, optionally producing plots, and returning the computed dataframe with the analysis results.

Rendering is performed for all modes contained self.selected_render_modes, which must be in self.valid_render_modes. You can pass additional parameters for rendering in render_kwargs, which will in turn be sent to enb.render.render_plds_by_group.

You can use the @enb.aanalysis.Analyzer.normalize_parameters decorator when overwriting this method, to automatically transform None values into their defaults.

Parameters:
  • full_df – full DataFrame instance with data to be plotted and/or analyzed.

  • target_columns – columns to be analyzed. Typically, a list of column names, although each subclass may redefine the accepted format (e.g., pairs of column names). If None, all scalar, non string columns are used.

  • selected_render_modes – a potentially empty list of mode names, all of which must be in self.valid_render_modes. Each mode represents a type of analysis or plotting. A single string can also be passed.

  • group_by – if not None, the name of the column to be used for grouping.

  • reference_group – if not None, the reference group name against which data are to be analyzed.

  • output_plot_dir – path of the directory where the plot/plots is/are to be saved. If None, the default output plots path given by enb.config.options is used.

  • column_to_properties – dictionary with ColumnProperties entries. ATable instances provide it in the column_to_properties attribute, Experiment instances can also use the joined_column_to_properties attribute to obtain both the dataset and experiment’s columns.

  • show_global – if True, a group containing all elements is also included in the analysis. If None, self.show_count is used.

  • show_count – if True or False, it determines whether the number of elements in each group is shown next to its name. If None, self.show_count is used.

  • render_kwargs – additional parameters for rendering in render_kwargs, which will in turn be sent to enb.render.render_plds_by_group

Returns:

a pandas.DataFrame instance with analysis results

get_output_pdf_path(column_selection, group_by, reference_group, output_plot_dir, render_mode)

Get the path of the PDF file to be created for a single parameter selection.

get_render_column_name(column_selection, selected_render_mode)

Return the canonical name for columns containing plottable data instances.

global_group_name = 'All'
global_y_label_margin = 15
grid_alpha = 0.6
group_row_margin = None
horizontal_margin = 0
latex_decimal_count = 3
legend_column_count = 2
legend_position = 'title'
main_alpha = 0.5
main_line_width = 2
main_marker_size = 4
classmethod normalize_parameters(fun, group_by, column_to_properties, target_columns, reference_group, output_plot_dir, selected_render_modes)

Optional decorator methods compatible with the Analyzer.get_df signature, so that managed attributes are used when

This way, users may overwrite most adjustable arguments programmatically, or via .ini configuration files.

plot_title = None
render_all_modes(summary_df, target_columns, output_plot_dir, reference_group, group_by, column_to_properties, selected_render_modes, show_global, show_count, **render_kwargs)

Render all target modes and columns into output_plot_dir, with file names based on self’s class name, the target column and the target render mode.

Subclasses may overwrite their update_render_kwargs_one_case method to customize the rendering parameters that are passed to the parallel rendering function from enb.plotdata. These overwriting methods are encouraged to call enb.aanalysis.Analyzer.update_render_kwargs_one_case (directly or indirectly) so make sure all necessary parameters reach the rendering function.

save_analysis_tables(group_by, reference_group, selected_render_modes, summary_df, summary_table, target_columns, column_to_properties, **render_kwargs)

Save csv and tex files into enb.config.options.analysis_dir that summarize the results of one target_columns element. If enb.config.options.analysis_dir is None or empty, no analysis is performed.

By default, the CSV contains the min, max, avg, std, and median of each group. Subclasses may overwrite this behavior.

secondary_alpha = 0.3
secondary_line_width = 1
secondary_marker_size = 2
selected_render_modes = {}
semilog_y_min_bound = 1e-05
show_count = True
show_global = False
show_grid = False
show_legend = True
show_reference_group = True
show_subgrid = False
show_x_std = False
show_y_std = False
style_list = None
subgrid_alpha = 0.4
tick_direction = 'in'
title_y = None
update_render_kwargs_one_case(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update column_kwargs with the desired rendering arguments for this column and render mode. Return the updated dict.

update_render_kwargs_reference_group(column_kwargs, reference_group)

Update the default render kwargs dir when a reference group is selected.

valid_render_modes = {}
vertical_margin = 0
class enb.aanalysis.AnalyzerSummary(analyzer, full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Bases: SummaryTable

Base class for the surrogate, dynamic summary tables employed by enb.aanalysis.Analyzer subclasses to gather analysis results and plottable data (when configured to do so).

__init__(analyzer, full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Dynamically generate the needed analysis columns and any other needed attributes for the analysis.

Columns that generate plottable data are automatically defined defined using self.render_target, based on the analyzer’s selected render modes.

Plot rendering columns are added automatically via this call, with associated function self.render_target with partialed parameters column_selection and render_mode.

Subclasses are encouraged to call self.move_render_columns_back() to make sure rendering columns are processed after any other intermediate column defined by the subclass.

Parameters:
  • analyzerenb.aanalysis.Analyzer subclass instance corresponding to this summary table.

  • full_df – full dataframe specified for analysis.

  • target_columns – columns for which an analysis is being requested.

  • reference_group – if not None, it must be the name of one group, which is used as baseline. Different subclasses may implement this in different ways.

  • group_by – grouping configuration for this summary. See the specific subclass help for more inforamtion.

  • include_all_group – if True, an “All” group with all input samples is included in the analysis.

add_render_columns()

Add to column_to_properties the list of columns used to compute instances from the plotdata module. :return the list of column names added in this call:

apply_reference_bias()

By default, no action is performed relative to the presence of a reference group, and not bias is introduced in the dataframe. Subclasses may overwrite this.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_plottable_data_one_case(*args, **kwargs)

Column-setting function (after “partialing”-out “column_selection” and “render_mode”), that computes the list of plotdata.PlottableData instances that represent one group, one target column and one render mode.

Subclasses must implement this method.

Parameters:
  • args – render configuration arguments is expected to contain values for the signature (self, group_label, row)

  • kwargs – dict with at least the “column_selection” and “render_mode” parameters.

move_render_columns_back()

Reorder the column definitions so that rendering columns are attempted after any column the subclass may have defined.

remove_nans(column_series)

Remove the infinite and NaN values from a pd.Series instance.

split_groups(reference_df=None, include_all_group=None)

Split the reference_df pandas.DataFrame into an iterable of (label, dataframe) tuples. This splitting is performed based on the value of self.group_by:

  • If it is None, a single group labelled “all” is created, associated to reference_df.

  • If it is not None:
    • It can be a pandas.DataFrame column index, e.g., a column name or a list of column names. In this case, the result pandas’ groupby is returned.

    • It can be a callable with a single argument reference_df. In this case, the result of calling that method with reference_df as argument is returned by the call to split_groups().

Subclasses can easily implement grouping custom grouping methods, which must adhere to the following constraints: - It must return an iterable of group_label, group_df tuples. - Unique group_label values must be returned.

Also note that:

  • It is NOT needed that the union of all group_df tuples yield reference_df.

  • It is NOT needed that the intersection of the any two group_df elements is empty.

  • The group_df dataframes normally contain all columns in reference_df, but it is NOT mandatory to maintain this behavior.

Parameters:
  • reference_df – if not None, a reference dataframe to split. If None, self.reference_df is employed instead.

  • include_all_group – if True, an “All” group is added, containing all input samples, regardless of the groups produced based on groupby. If None, self’s class is queried for that attribute.

Returns:

an iterable of (label, dataframe) tuples.

class enb.aanalysis.DictNumericAnalyzer(csv_support_path=None, column_to_properties=None, progress_report_period=None, combine_keys_callable=None)

Bases: Analyzer

Analyzer for columns with associated ColumnProperties having has_dict=True. Dictionaries are expected to have numeric entries.

__init__(csv_support_path=None, column_to_properties=None, progress_report_period=None, combine_keys_callable=None)
Parameters:
  • csv_support_path – support path where results are to be stored. If None, results are not automatically made persistent.

  • column_to_properties – dictionary mapping column names to ther properties

  • progress_report_period – period with which progress reports are emitted by the parallel computation of the analysis table.

  • combine_keys_callable – if not None, it must be a callable that takes dictionary with numeric values and return another one with possibly different keys (e.g., merging several dict keys into one).

build_summary_atable(full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Build a enb.aanalysis.AnalyzerSummary instance with the appropriate columns to perform the intended analysis. See enb.aanalysis.AnalyzerSummary for documentation on the meaning of each argument.

Parameters:
  • full_df – dataframe instance being analyzed

  • target_columns – list of columns specified for analysis

  • reference_group – if not None, the column or group to be used as baseline in the analysis

  • include_all_group – force inclusion of an “All” group with all samples

  • render_kwargs – additional keyword arguments passed to get_df for adjusting the rendering process. They can be used, but are not typically needed, for implementing the summary table’s methods. See enb.aanalysis.get_df for details.

Returns:

the built summary table, without having called its get_df method.

column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

combine_groups = False
common_group_scale = True
get_df(full_df, target_columns, selected_render_modes=None, output_plot_dir=None, group_by=None, reference_group=None, column_to_properties=None, show_global=None, show_count=True, key_to_x=None, **render_kwargs)

Analyze and plot columns containing dictionaries with numeric data.

Parameters:
  • full_df – full DataFrame instance with data to be plotted and/or analyzed.

  • target_columns – columns to be analyzed. Typically, a list of column names, although each subclass may redefine the accepted format (e.g., pairs of column names). If None, all scalar, non string columns are used.

  • selected_render_modes – a potentially empty list of mode names, all of which must be in self.valid_render_modes. Each mode represents a type of analysis or plotting.

  • group_by – if not None, the name of the column to be used for grouping.

  • reference_group – if not None, the reference group name against which data are to be analyzed.

  • output_plot_dir – path of the directory where the plot/plots is/are to be saved. If None, the default output plots path given by enb.config.options is used.

  • column_to_properties – dictionary with ColumnProperties entries. ATable instances provide it in the column_to_properties attribute, Experiment instances can also use the joined_column_to_properties attribute to obtain both the dataset and experiment’s columns.

  • show_global – if True, a group containing all elements is also included in the analysis

  • key_to_x – if provided, it can be a mapping between the keys found in the column data dictionaries, and the x value in which they should be plotted.

Returns:

a pandas.DataFrame instance with analysis results

main_alpha = 0.5
main_line_width = 2
main_marker_size = 4
plot_title = None
secondary_alpha = 0.3
secondary_line_width = 1
secondary_marker_size = 2
selected_render_modes = {'line'}
semilog_y_min_bound = 1e-05
show_count = True
show_global = False
show_individual_samples = True
show_legend = True
show_x_std = False
show_y_std = True
title_y = None
update_render_kwargs_one_case(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update column_kwargs with the desired rendering arguments for this column and render mode. Return the updated dict.

valid_render_modes = {'line'}
class enb.aanalysis.DictNumericSummary(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Bases: AnalyzerSummary

Summary table for the DictNumericAnalyzer.

__init__(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Dynamically generate the needed analysis columns and any other needed attributes for the analysis.

Columns that generate plottable data are automatically defined defined using self.render_target, based on the analyzer’s selected render modes.

Plot rendering columns are added automatically via this call, with associated function self.render_target with partialed parameters column_selection and render_mode.

Subclasses are encouraged to call self.move_render_columns_back() to make sure rendering columns are processed after any other intermediate column defined by the subclass.

Parameters:
  • analyzerenb.aanalysis.Analyzer subclass instance corresponding to this summary table.

  • full_df – full dataframe specified for analysis.

  • target_columns – columns for which an analysis is being requested.

  • reference_group – if not None, it must be the name of one group, which is used as baseline. Different subclasses may implement this in different ways.

  • group_by – grouping configuration for this summary. See the specific subclass help for more inforamtion.

  • include_all_group – if True, an “All” group with all input samples is included in the analysis.

add_group_description_columns()

Add several columns that compute basic (scalar) numerical stats including min, max, avg, standard deviation and median.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

combine_keys(*args, **kwargs)

Combine the keys of a column

compute_plottable_data_one_case(*args, **kwargs)

Column-setting function that computes a list of enb.plotdata.PlottableData elements for this case (group, column, render_mode).

See enb.aanalysis.AnalyzerSummary.compute_plottable_data_one_case for additional information.

set_group_description(*args, **kwargs)

Set the columns that compute basic (scalar) numerical stats including min, max, avg, standard deviation and median.

class enb.aanalysis.HistogramKeyBinner(min_value, max_value, bin_count, normalize=False)

Bases: object

Helper class to transform numeric-to-numeric dicts into other dicts binning keys like an histogram.

__init__(min_value, max_value, bin_count, normalize=False)
Parameters:
  • min_value – minimum expected key value

  • max_value – maximum expected key value

  • bin_count – number of bins in the histogram

  • normalize – if True, the relative frequencies are computed, instead of the absolute frequences

class enb.aanalysis.ScalarNumeric2DAnalyzer(csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: ScalarNumericAnalyzer

Analyzer able to process scalar numeric values located on an (x, y) plane. The target_columns parameter must be an iterable of tuples with 3 elements, containing the columns with the x, y and data coordinates, e.g., (“column_x”, “column_y”, “column_data”).

Each of the three columns must contain scalar numerical data and are treated as such.

In order to do something similar but with categorical (string) x and y axis (e.g., to display a table of numerical means indexed by corpus and task label) please see the ScalarTwoGroupAnalyzer class.

bad_data_color = 'magenta'
bin_count = 50
build_summary_atable(full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Dynamically build a SummaryTable instance for scalar value analysis.

color_map = 'inferno'
column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

no_data_color = (1, 1, 1, 0)
selected_render_modes = {'colormap'}
update_render_kwargs_one_case(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update column_kwargs with the desired rendering arguments for this column and render mode. Return the updated dict.

valid_render_modes = {'colormap'}
x_tick_format_str = '{:.2f}'
y_tick_format_str = '{:.2f}'
class enb.aanalysis.ScalarNumeric2DSummary(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Bases: ScalarNumericSummary

Compute the numerical summaries for a ScalarNumeric2DAnalyzer.

__init__(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Dynamically generate the needed analysis columns and any other needed attributes for the analysis.

Columns that generate plottable data are automatically defined defined using self.render_target, based on the analyzer’s selected render modes.

Plot rendering columns are added automatically via this call, with associated function self.render_target with partialed parameters column_selection and render_mode.

Subclasses are encouraged to call self.move_render_columns_back() to make sure rendering columns are processed after any other intermediate column defined by the subclass.

Parameters:
  • analyzerenb.aanalysis.Analyzer subclass instance corresponding to this summary table.

  • full_df – full dataframe specified for analysis.

  • target_columns – columns for which an analysis is being requested.

  • reference_group – if not None, it must be the name of one group, which is used as baseline. Different subclasses may implement this in different ways.

  • group_by – grouping configuration for this summary. See the specific subclass help for more inforamtion.

  • include_all_group – if True, an “All” group with all input samples is included in the analysis.

apply_reference_bias()

Compute the average value of the reference group for each target column and subtract it from the dataframe being analyzed. If not reference group is present, no action is performed.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_colormap_plottable_one_case(*args, **kwargs)

Compute the list of enb.plotdata.PlottableData elements for a single render mode.

compute_plottable_data_one_case(*args, **kwargs)

Column-setting function that computes a list of enb.plotdata.PlottableData elements for this case (group, column, render_mode).

See enb.aanalysis.AnalyzerSummary.compute_plottable_data_one_case for additional information.

class enb.aanalysis.ScalarNumericAnalyzer(csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: Analyzer

Analyzer subclass for scalar columns with numeric values.

bar_width_fraction = 1
build_summary_atable(full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Dynamically build a SummaryTable instance for scalar value analysis.

column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

combine_groups = False
common_group_scale = True
histogram_bin_count = 50
main_alpha = 0.5
main_line_width = 2
main_marker_size = 4
plot_title = None
secondary_alpha = 0.3
secondary_line_width = 1
secondary_marker_size = 2
selected_render_modes = {'boxplot', 'hbar', 'histogram'}
semilog_y_min_bound = 1e-05
show_count = True
show_global = False
show_individual_samples = False
show_legend = True
show_x_std = True
show_y_std = False
sort_by_average = False
title_y = None
update_render_kwargs_one_case(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update column_kwargs with the desired rendering arguments for this column and render mode. Return the updated dict.

update_render_kwargs_one_case_boxplot(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update rendering kwargs (e.g., labels) specifically for the boxplot mode.

update_render_kwargs_one_case_hbar(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update rendering kwargs (e.g., labels) specifically for the hbarmode.

update_render_kwargs_one_case_histogram(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update rendering kwargs (e.g., labels) specifically for the histogram mode.

valid_render_modes = {'boxplot', 'hbar', 'histogram'}
class enb.aanalysis.ScalarNumericJointAnalyzer(csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: Analyzer

Analyze scalar numeric data, providing joint (simultaneous) grouping by two categories. This is useful to produce tables of averages, e.g., for each corpus/task combination.

build_summary_atable(full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Build a enb.aanalysis.AnalyzerSummary instance with the appropriate columns to perform the intended analysis. See enb.aanalysis.AnalyzerSummary for documentation on the meaning of each argument.

Parameters:
  • full_df – dataframe instance being analyzed

  • target_columns – list of columns specified for analysis

  • reference_group – if not None, the column or group to be used as baseline in the analysis

  • include_all_group – force inclusion of an “All” group with all samples

  • render_kwargs – additional keyword arguments passed to get_df for adjusting the rendering process. They can be used, but are not typically needed, for implementing the summary table’s methods. See enb.aanalysis.get_df for details.

Returns:

the built summary table, without having called its get_df method.

cell_alignment = 'center'
col_header_alignment = 'center'
column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

edges = 'closed'
fig_height = None
fig_width = None
get_df(full_df, target_columns, selected_render_modes=None, output_plot_dir=None, group_by=None, reference_group=None, column_to_properties=None, show_global_row=None, show_global_column=None, show_count=None, x_header_list=None, y_header_list=None, highlight_best_column=None, highlight_best_row=None, **render_kwargs)

Wrapper for enb.aanalysis.Analyzer.get_df, but adding support for the x_header_list and y_header_list parameters to control the row and column order.

Parameters:
  • x_header_list – non-empty list of strings corresponding to the x categories (column headers). All strings in this list must exist as a category. However, not all categories need to be present in this list. If None, all headers are used in alphabetical order. Note that this applies to all elements of target_columns - please invoke this function multiple times if different headers (or ordering) are needed for different elements in target_columns.

  • y_header_list – non-empty list of strings corresponding to the y categories (row headers). All strings in this list must exist as a category. However, not all categories need to be present in this list. If None, all headers are used in alphabetical order. Note that this applies to all elements of target_columns - please invoke this function multiple times if different headers (or ordering) are needed for different elements in target_columns.

  • highlight_best_column – Optionally highlight the best results in each row. Must be one of “low”, “high” or None.

  • highlight_best_row – Optionally highlight the best results in each column. Must be one of “low”, “high” or None

  • show_global_row – if True, or if None and self.show_global is True, an extra row is added to the analysis, corresponding to not splitting the data into the y categories (i.e., considering the average for all possible y categories). Note that if y_header_list is selected, averages are considered only for those categories.

  • show_global_column – like show_global_row, but adds a new column corresponding to not splitting into x categories. Note that if x_header_list is selected, averages are considered only for those categories.

get_filtered_x_y_categories(x_categories, y_categories, reference_group)

Take the list of x and y category values and filter out the reference group if show_reference_group is False.

highlight_best_column = None
highlight_best_row = None
number_format = '{:.3f}'
row_header_alignment = 'left'
save_analysis_tables(group_by, reference_group, selected_render_modes, summary_df, summary_table, target_columns, column_to_properties, **render_kwargs)

Store the joint analysis results in CSV and latex formats.

If enb.config.options.analysis_dir is None or empty, no analysis is performed.

selected_render_modes = {'table'}
should_highlight_cell(summary_dict, summary_table, x_categories, x_category, y_categories, y_category)
show_global_column = False
show_global_row = False
show_reference_group = False
update_render_kwargs_one_case(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update column_kwargs with the desired rendering arguments for this column and render mode. Return the updated dict.

valid_render_modes = {'table'}
class enb.aanalysis.ScalarNumericJointSummary(analyzer, full_df, target_columns, reference_group, group_by, show_global_row, show_global_column, x_header_list, y_header_list, highlight_best_row, highlight_best_column)

Bases: ScalarNumericSummary

Summary tables for the ScalarNumericJoinAnalyzer class.

__init__(analyzer, full_df, target_columns, reference_group, group_by, show_global_row, show_global_column, x_header_list, y_header_list, highlight_best_row, highlight_best_column)

Identical to enb.aanalysis.ScalarNumericSummary.__init__(), but calculates the scalar numeric description for each x-category/y-category combination (instead of for all samples, like ScalarNumericSummary does). It also allows defining the column and row order with x_header_list, y_header_list.

The following parameters differ from enb.aanalysis.ScalarNumericSummary.__init__():

Parameters:
  • reference_group – if not None, it must be either an x-category value (column header) or a y-category value (row header). If present, the data columns are subtracted the averages of the entries of the selected category so that that column (if an x-category value is selected) or that row (if a y-category value is selected) is all zeros, and all other entries are relative to the selected category.

  • x_header_list – non-empty list of strings corresponding to the x categories (column headers). All strings in this list must exist as a category. However, not all categories need to be present in this list. If None, all headers are used in alphabetical order.

  • y_header_list – non-empty list of strings corresponding to the y categories (row headers). All strings in this list must exist as a category. However, not all categories need to be present in this list. If None, all headers are used in alphabetical order.

  • highlight_best_row – if not None, it must be “low” or “hight”, which determines how the best element in each column is selected, and highlighted.

  • highlight_best_column – if not None, it must be “low” or “hight”, which determines how the best element in each row is selected, and highlighted.

  • show_global_row – if True, an “All” row is added with the average of all rows (assuming at least two rows are present)

  • show_global_column – if True, an “All” row is added with the average of all columns (assuming at least two columns are present)

add_joint_scalar_description_columns(x_column, y_column, data_column)

Add column_functions to this summary table that generate the scalar data description of data_column grouped by (x_column, y_column).

apply_reference_bias()

If applicable, group reference bias is applied when computing the joint scalar descriptions.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_plottable_data_one_case(*args, **kwargs)

Column-setting function that computes a list of enb.plotdata.PlottableData elements for this case (group, column, render_mode).

See enb.aanalysis.AnalyzerSummary.compute_plottable_data_one_case for additional information.

set_joint_scalar_description(*args, **kwargs)

Set the joint scalar description of a row’s column. For each descriptor (e.g., min or std), a dictionary is stored indexed by the group’s (x_column,y_column) categories.

class enb.aanalysis.ScalarNumericSummary(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Bases: AnalyzerSummary

Summary table used in ScalarValueAnalyzer, defined dynamically with each call to maintain independent column definitions.

Note that dynamically in this context implies that modifying the returned instance’s class columns does not affect the definition of other instances of this class.

Note that in most cases, the columns returned by default should suffice.

If a reference_group is provided, its average is computed and subtracted from all values when generating the plot.

__init__(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Dynamically generate the needed analysis columns and any other needed attributes for the analysis.

Columns that generate plottable data are automatically defined defined using self.render_target, based on the analyzer’s selected render modes.

Plot rendering columns are added automatically via this call, with associated function self.render_target with partialed parameters column_selection and render_mode.

Subclasses are encouraged to call self.move_render_columns_back() to make sure rendering columns are processed after any other intermediate column defined by the subclass.

Parameters:
  • analyzerenb.aanalysis.Analyzer subclass instance corresponding to this summary table.

  • full_df – full dataframe specified for analysis.

  • target_columns – columns for which an analysis is being requested.

  • reference_group – if not None, it must be the name of one group, which is used as baseline. Different subclasses may implement this in different ways.

  • group_by – grouping configuration for this summary. See the specific subclass help for more inforamtion.

  • include_all_group – if True, an “All” group with all input samples is included in the analysis.

add_scalar_description_columns(column_name)

Add the scalar description columns for a given column_name in the pandas.DataFrame instance being analyzed.

apply_reference_bias()

Compute the average value of the reference group for each target column and subtract it from the dataframe being analyzed. If not reference group is present, no action is performed.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_boxplot_plottable_one_case(*args, **kwargs)

Compute the list of enb.plotdata.PlottableData elements for a single render mode.

compute_hbar_plottable_one_case(*args, **kwargs)

Compute the list of enb.plotdata.PlottableData elements for a single render mode.

compute_histogram_plottable_one_case(*args, **kwargs)

Compute the list of enb.plotdata.PlottableData elements for a single render mode.

compute_plottable_data_one_case(*args, **kwargs)

Column-setting function that computes a list of enb.plotdata.PlottableData elements for this case (group, column, render_mode).

See enb.aanalysis.AnalyzerSummary.compute_plottable_data_one_case for additional information.

numeric_series_to_stat_dict(series: Series, group_label: str = None)

Convert a series of numeric data into a dictionary of stats (‘avg’, ‘min’, ‘max’, ‘std’, ‘count’).

Parameters:

series – series of numeric scalar data to be analyzed. Infinite and nan values are removed before processing.

Returns:

a dictionary of stats for series.

set_scalar_description(*args, **kwargs)

Set basic descriptive statistics for the target column

class enb.aanalysis.TwoNumericAnalyzer(csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: Analyzer

Analyze pairs of columns containing scalar, numeric values. Compute basic statistics and produce a scatter plot based on the obtained data.

As opposed to ScalarNumericAnalyzer, target_columns should be an iterable of tuples with 2 column names (other elements are ignored). When applicable, the first column in each tuple is considered the x column, and the second the y column.

average_identical_x = False
build_summary_atable(full_df, target_columns, reference_group, group_by, include_all_group, **render_kwargs)

Build a enb.aanalysis.AnalyzerSummary instance with the appropriate columns to perform the intended analysis. See enb.aanalysis.AnalyzerSummary for documentation on the meaning of each argument.

Parameters:
  • full_df – dataframe instance being analyzed

  • target_columns – list of columns specified for analysis

  • reference_group – if not None, the column or group to be used as baseline in the analysis

  • include_all_group – force inclusion of an “All” group with all samples

  • render_kwargs – additional keyword arguments passed to get_df for adjusting the rendering process. They can be used, but are not typically needed, for implementing the summary table’s methods. See enb.aanalysis.get_df for details.

Returns:

the built summary table, without having called its get_df method.

column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

combine_groups = True
common_group_scale = True
main_alpha = 0.5
main_line_width = 2
main_marker_size = 4
plot_title = None
save_analysis_tables(group_by, reference_group, selected_render_modes, summary_df, summary_table, target_columns, column_to_properties, **render_kwargs)

Save csv and tex files into enb.config.options.analysis_dir that summarize the results of one target_columns element. If enb.config.options.analysis_dir is None or empty, no analysis is performed.

By default, the CSV contains the min, max, avg, std, and median of each group. Subclasses may overwrite this behavior.

secondary_alpha = 0.3
secondary_line_width = 1
selected_render_modes = {'line', 'scatter'}
semilog_y_min_bound = 1e-05
show_count = False
show_global = False
show_individual_samples = True
show_legend = True
show_linear_regression = False
show_x_std = True
show_y_std = True
title_y = None
update_render_kwargs_one_case(column_selection, reference_group, render_mode, summary_df, output_plot_dir, group_by, column_to_properties, show_global, show_count, **column_kwargs)

Update column_kwargs with the desired rendering arguments for this column and render mode. Return the updated dict.

valid_render_modes = {'line', 'scatter'}
class enb.aanalysis.TwoNumericSummary(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Bases: ScalarNumericSummary

Summary table used in TwoNumericAnalyzer.

For this class, target_columns must be a list of tuples, each tuple containing two column name, corresponding to x and y, respectively. Scalar analysis is provided on each column individually, as well as basic correlation metrics for each pair of columns.

__init__(analyzer, full_df, target_columns, reference_group, group_by, include_all_group)

Dynamically generate the needed analysis columns and any other needed attributes for the analysis.

Columns that generate plottable data are automatically defined defined using self.render_target, based on the analyzer’s selected render modes.

Plot rendering columns are added automatically via this call, with associated function self.render_target with partialed parameters column_selection and render_mode.

Subclasses are encouraged to call self.move_render_columns_back() to make sure rendering columns are processed after any other intermediate column defined by the subclass.

Parameters:
  • analyzerenb.aanalysis.Analyzer subclass instance corresponding to this summary table.

  • full_df – full dataframe specified for analysis.

  • target_columns – columns for which an analysis is being requested.

  • reference_group – if not None, it must be the name of one group, which is used as baseline. Different subclasses may implement this in different ways.

  • group_by – grouping configuration for this summary. See the specific subclass help for more inforamtion.

  • include_all_group – if True, an “All” group with all input samples is included in the analysis.

add_twoscalar_description_columns(column_names)

Add columns that compute several statistics considering two columns jointly, e.g., their correlation.

apply_reference_bias()

If a reference group is selected, subtract the reference_group’s average values from all groups, so that reference_group becomes the baseline.

The scatter and line render modes cannot be simultaneously selected if a reference group is selected, since the bias is applied differently for each mode.

apply_reference_bias_line()

Make the reference group the baseline (if one is selected). For each x value, subtract the reference group’s average value of each y column.

All groups must share their x values, i.e., must be x-aligned, or a ValueError is raised.

Each y-column can only be used with one x-column, or a ValueError is raised.

apply_reference_bias_scatter()

Subtract the reference group’s average value of all requested x and y columns.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_plottable_data_one_case(*args, **kwargs)

Column-setting function that computes a list of enb.plotdata.PlottableData elements for this case (group, column, render_mode).

See enb.aanalysis.AnalyzerSummary.compute_plottable_data_one_case for additional information.

set_twoscalar_description(*args, **kwargs)

Set basic descriptive statistics for the target column

enb.aanalysis.columnname_to_labels(column_name)

Guess x_label and y_label from a name column. If _to_ is found once in the string, x_label will be obtained from the text to the left, and y_label from the text to the right. Otherwise, x_label is set using the complete column_name string, and y_label is None

enb.aanalysis.get_groupby_str(group_by)

Return a string identifying the group_by method. If None, ‘None’ is returned. If a string is passed, it is returned. If a list of strings is passed, it is formatted adding two underscores as separation. If grouping by family was requested, ‘family_label’ is returned.

enb.aanalysis.is_family_grouping(group_by)

Return True if and only if group_by is an iterable of one or more enb.experiment.TaskFamily instances.

enb.atable module

enb.atable: Automatic tables with implicit column definition

enb.atable.ATable produces pandas.DataFrame instances

This module defines the enb.atable.ATable class, which is the base for all automatic tables in enb.

All enb.atable.ATable subclasses generate a pandas.DataFrame instance when their get_df method is successfully called. These are powerful dynamic tables that can be used directly, and/or passed to some of the tools in the enb.aanalysis module to easily produce figures and tables.

enb.atable.ATable provides automatic persistence

The produced tables are automatically stored into persistent disk storage in CSV format. This offers several key advantages:

  • It avoids recalculating already known values. This speeds up subsequent calls to get_df for the same inputs.

  • It allows sharing your raw results in a convenient way.

  • It can help you reuse results from different projects.

This is best supported for numeric, string, and boolean types, which are assumed by default. You can also use non-scalar types, e.g., list, tuple and dict types, by setting the has_iterable_values and has_dict_values for enb.atable.ColumnProperties’s constructor (more on that later).

Finally, you can use any python object that can be pickled and unpickled. For this to work for a given column, the has_object_values needs to be set to True it the aforementioned constructor.

The only restriction is not to use None nor any other value detected as null by pandas, because these are used to efficiently signal the absence of data.

Using existing enb.atable.ATable columns

enb implements several enb.atable.ATable subclasses that can be directly used in your code. All enb.atable.ATable subclasses work as follows:

  1. They accept an iterable (e.g., a list) of indices as an input. An index is often a string, e.g., a path to an element of your test dataset. Note that enb is capable of creating that list of indices if you point it to your dataset folder.

  2. For each row, the set of defined data columns (e.g., the dependent/independent variables of an experiment) is computed and stored to disk along with the row’s index. You can reuse existing ATable subclasses directly and/or create new subclasses.

Consider the following toy example:

import enb

class TableA(enb.atable.ATable):
    def column_index_length(self, index, row):
        return len(index)

Our TableA class accepts list-like values (e.g., strings) as indices, and defines the index_length column as the number of elements (e.g., characters) in the index.

One can then use the get_df method to obtain a pandas.DataFrame instance as follows:

table_a = TableA(index="my_index_name")
example_indices = ["ab c" * i for i in range(10)]  # List of iterables
df = table_a.get_df(target_indices=example_indices)
print(df.head())

The previous code should produce the following output (automatic timestamping columns now shown):

                          my_index_name index_length
__atable_index
('',)                                              0
('ab c',)                          ab c            4
('ab cab c',)                  ab cab c            8
('ab cab cab c',)          ab cab cab c           12
('ab cab cab cab c',)  ab cab cab cab c           16

Note that the __atable_index is the dataframe’s index, which is set and used by ATable subclasses internally. This internal index is not included in the persistence data (i.e., it is not part of the CSV tables output to disk). Notwithstanding, the column values needed to build back this index are stored in the CSV

New columns: defining and composing enb.atable.ATable subclasses

enb defines many columns in their core and plugin classes. If you need more, you can easily create new enb.atable.ATable subclasses with custom columns, as explained next.

You can use string, number and boolean types for scalar columns, and dict-like and list-like (mappings and iterables) for non-scalar columns.

Basic column definition

The fastest way of defining a column is to subclass enb.atable.ATable and to create methods with names that start with column_. The value returned by these methods is automatically stored in the appropriate cell of the dataframe.

An example of this approach is copied from TableA above:

import enb

class TableA(enb.atable.ATable):
    def column_index_length(self, index, row):
        return len(index)

which defines the index_length column in that table.

Advanced column definition

To further customize your new columns, you can use the enb.atable.column_function() decorator.

  1. You can add column metainformation on how enb.aanalysis plots the data by default, e.g., labels, ranges, logarithmic axes, etc. An example column with descriptive label can be defined as follows:

    @enb.atable.column_function("uppercase", label="Uppercase version of the index")
    def set_character_sum(self, index, row):
        row["uppercase"] = index.upper()
    

    See the enb.atable.ColumnProperties class for all available plotting cues.

  2. You can set two or more columns with a single function. To do so, you can pass a list of enb.atable.ColumnProperties instances to the enb.atable.column_function() decorator. Each instance describes one column, which can be independently customized.

  3. You can define columns to contain non-scalar data. The following default types are supported: tuples, lists, dicts. Note that using non-scalar data is generally slower than using scalar types, but allows easy aggregation and combination of variables.

  4. You can mix strings and enb.atable.ColumnProperties instances in the enb.atable.column_function() decorator.

The following snippet illustrates points 2 onwards:

class TableB(TableA):
    @enb.atable.column_function("uppercase", label="Uppercase version of the index")
    def set_character_sum(self, index, row):
        row["uppercase"] = index.upper()

    @enb.atable.column_function(
        enb.atable.ColumnProperties(
            "first_and_last",
            label="First and last characters of the index",
            has_dict_values=True),

        "constant_zero",

        enb.atable.ColumnProperties(
            "space_count",
            label="Number of spaces in the string",
            plot_min=0),

    )
    def function_for_two_columns(self, index, row):
        row["first_and_last"] = {"first": index[0] if index else "",
                                 "last": index[-1] if index else ""}
        row["constant_zero"] = 0
        row["space_count"] = sum(1 for c in index if c == " ")

After the definition, the table’s dataframe can be obtained with print(TableB().get_df(target_indices=example_indices).head()) to obtain simething similar to:

                                  file_path index_length         uppercase               first_and_last space_count constant_zero
__atable_index
('',)                                              0                      {'first': '', 'last': ''}           0             0
('ab c',)                          ab c            4              AB C  {'first': 'a', 'last': 'c'}           1             0
('ab cab c',)                  ab cab c            8          AB CAB C  {'first': 'a', 'last': 'c'}           2             0
('ab cab cab c',)          ab cab cab c           12      AB CAB CAB C  {'first': 'a', 'last': 'c'}           3             0
('ab cab cab cab c',)  ab cab cab cab c           16  AB CAB CAB CAB C  {'first': 'a', 'last': 'c'}           4             0
class enb.atable.ATable(index='index', csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: object

Automatic table with implicit column definition.

ATable subclasses have the get_df method, which returns a pandas.DataFrame instance with the requested data. You can use (multiple) inheritance using one or more ATable subclasses to combine the columns of those subclasses into the newly defined one. You can then define methods with names that begin with column_, or using the @enb.atable.column_function decorator on them.

See enb.atable for more detailed help and examples.

__init__(index='index', csv_support_path=None, column_to_properties=None, progress_report_period=None)
Parameters:
  • index – string with column name or list of column names that will be used for indexing. Indices provided to self.get_df must be either one instance (when a single column name is given) or a list of as many instances as elements are contained in self.index. See `self.indices.

  • csv_support_path – path to a file where this ATable contents are to be stored and retrieved. If None, persistence is disabled.

  • column_to_properties – if not None, it is a mapping from strings to callables that defines the columns of the table and how to obtain the cell values

  • progress_report_period – if not None, it must be a positive number of seconds that are waited between progress report messages (if applicable).

static add_column_function(target_class, fun, column_properties)

Main entry point for column definition in enb.atable.ATable subclasses.

Methods decorated with enb.atable.column_function(), or with a name beginning with column_ are automatically “registered” using this function. It can be invoked directly to add columns manually, although it is not recommended in most scenarios.

Parameters:
  • target_class – the enb.atable.ATable subclass to which a new column is to be added.

  • column_properties – a enb.atable.ColumnProperties instance describing the column to be created. This list may also contain strings, which are interpreted as column names, creating the corresponding columns.

  • fun – column-setting function. It must have a signature compatible with a call (self, index, row), where index is the row’s index and row is a dict-like object where the new column is to be stored. Previously set columns can also be read from row. When a column-setting method is decorated, fun is automatically set so that the decorated method is called, but it is not guaranteed that fun is the decorated method.

assert_df_sanity(df)

Perform a sanity check on the df, assuming it was produced by enb (e.g., via get_df or load_saved_df).

Raises:

CorruptedTableError – if the sanity check is not passed.

classmethod build_column_function_wrapper(fun, column_properties)

Build the wrapper function applied to all column-setting functions given a column properties instance.

enb.atable.ATable’s implementation of build_column_function_wrapper adds two variables to the column-setting function’s scope: _column_name and _column_properties, in addition to verifying the column-setting function’s signature.

Notwithstanding, this behavior can be altered in enb.atable.ATable subclasses, affecting only the wrappers for that class’ column-setting functions.

Parameters:
  • fun – function to be called by the wrapper.

  • column_propertiesenb.atable.ColumnProperties instance with properties associated to the column.

Returns:

a function that wraps fun adding _column_name and _column_properties to its scope.

classmethod column_function(*column_properties, **kwargs)

Decorator for functions that produce values for column_name when given the current index and current column values.

Decorated functions are expected to have signature (atable, index, row), where atable is an ATable instance, index is a tuple of index values (corresponding to self.index), and row is a dict-like instance to be filled in by f.

Columns are sorted by the order in which they are defined, i.e., when a function is decorated for the corresponding column. Redefinitions are not allowed.

A variable _column is added to the decorated function’s scope, e.g., to assign values to the intended column of the row object.

Parameters:

column_properties

a list of one or more of the following types of elements:

  • a string with the column’s name to be used in the table. A new enb.atable.ColumnProperties instance is then created, passing **kwargs to the initializer.

  • a ColumnProperties instance. In this case **kwargs is ignored.

column_to_properties = {}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_one_row(filtered_df, index, loc, column_fun_tuples, overwrite)

Process a single row of an ATable instance, returning a Series object corresponding to that row. If an error is detected, an exception is returned instead of a Series. Note that the exception is not raised here, but intended to be detected by the compute_target_rows(), i.e., the dispatcher function.

Parameters:
  • filtered_dfpandas.DataFrame retrieved from persistent storage, with index compatible with loc. The loc argument itself needs not be present in filtered_df, but is used to avoid recomputing in case overwrite is not True and columns had been set.

  • index – index value or values corresponding to the row to be processed.

  • loc – location compatible with .loc of filtered_df (although it might not be present there), and that will be set into the full loaded_df also using its .loc accessor.

  • column_fun_tuples – a list of (column, fun) tuples, where fun is to be invoked to fill column

  • overwrite – if True, existing values are overwritten with newly computed data

Returns:

a pandas.Series instance corresponding to this row, with a column named as given by self.private_index_column set to the loc argument passed to this function.

compute_target_rows(loaded_df, target_df, target_indices, target_columns, overwrite, progress_tracker=None)

Generate and return a pandas.DataFrame with as many rows as given by target_indices, with the columns given in target_columns, using this table’s column-setting functions.

This method is run when there are one or more known missing values in the requested df, e.g., there are:

  • missing columns of existing rows, and/or

  • new rows to be added (i.e., target_locs contains at least one index not in loaded_df).

The calling function must choose target_indices to be the list of needed indices (not locs).

Note that this method does not modify loaded_df.

Parameters:
  • loaded_df – the full loaded dataframe read from persistence (or created anew). It is used to avoid recomputation of existing columns, but it is not modified by this method.

  • target_df – a dataframe with identical column structure as loaded_table, with the subset of loaded_table’s rows that match the target indices (i.e., inner join)

  • target_indices – list of indices to be filled

  • target_columns – list of columns that are to be computed. enb defaults to calling this function with all columns in the table.

  • overwrite – if True, all cell values are computed even when storage data was present. In that case, the newly computed results replace the old ones.

  • progress_tracker – if not None, an enb.progress.ProgressTracker instance currently being used to keep track of this ATable’s get_df call.

Returns:

a pandas.DataFrame instance with the same column structure as loaded_df (i.e., following this class’ column defintion). Each row corresponds to one element in target_indices, maintaining the same order.

Raises:

ColumnFailedError – if any of the column-setting functions crashes or fails to set a value to their assigned table cell.

dataset_files_extension = ''

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

get_all_input_indices(ext=None, base_dataset_dir=None)

Get a list of all input indices (recursively) contained in base_dataset_dir. By default, the global function enb.atable.get_all_input_files is called.

get_df(target_indices=None, target_columns=None, fill=None, overwrite=None, chunk_size=None, progress_tracker=None)

Core method for all enb.atable.ATable subclasses to obtain the table’s content. The following principles guide the way get_df works:

  • This method returns a pandas.DataFrame containing one row per element in target_indices, and as many columns as there are defined in self.column_to_properties. If target_indices is None, all files in enb.config.options.base_dataset_dir are used (after filtering by self.dataset_files_extension) by default.

  • Any persistence data already present is loaded, and only new indices or columns are added. This way, each column-setting function needs to be called only once per index for any given enb.atable.ATable subclass.

  • Results are returned only for target_indices, even if you previously computed other rows. Thus, only not-already-present indices and new columns require actual computation. Any new result produced by this call is appended to the already existing persistence data.

  • Rows computed in a previous call to this get_df are not deleted from persistent data storage, even if target_indices contains fewer or different indices than in previous calls.

  • Beware that if you remove a column definition from this enb.atable.ATable subclass and run get_df, that column will be removed from persistent storage. If you add a new column, that value will be computed for all rows in target_indices,

  • You can safely select new and/or different target_indices. New data are stored, and existent rows are not removed. If you add new column definitions, those are computed for target_indices only. If there are other previously existing rows, they are flagged as incomplete, and those new columns will be computed only when those rows’ indices are included in target_indices.

Recall that table cell values are restricted to be numeric, string, boolean or non-scalar, i.e., list, tuple or dict.

Parameters:
  • target_indices – list of indices that are to be contained in the table, or None to infer automatically from the dataset.

  • target_columns – if not None, it must be a list of column names (defined for this class) that are to be obtained for the specified indices. If None, all columns are used.

  • fill – If True or False, it determines whether values are computed for the selected indices. If None, values are only computed if enb.config.options.no_new_results is False.

  • overwrite – values selected for filling are computed even if they are present in permanent storage. Otherwise, existing values are skipped from the computation.

  • chunk_size – If None, its value is assigned from options.chunk_size. After this, if not None, the list of target indices is split in chunks of size at most chunk_size elements (each one corresponding to one row in the table). Results are made persistent every time one of these chunks is completed. Setting chunk_size to -1 is functionally identical to setting it to None (or to the number of target indices), but it does not display “Starting chunk 1/1…” (useful if the chunk partitioning is performed outside, i.e., by an Experiment class).

  • progress_tracker – if not None, the enb.progress.ProgressTracker instance being used to keep track of an ATable instance at an upper level.

Returns:

a DataFrame instance containing the requested data

Raises:

CorruptedTableError, ColumnFailedError, when an error is encountered processing the data.

get_df_one_chunk(target_indices, target_columns, fill_needed, overwrite, run_sanity_checks, progress_tracker=None)

Internal implementation of the get_df() functionality, to be applied to a single chunk of indices. It is essentially a self-contained call to meth:enb.atable.ATable.get_df as described in its documentation, where data are stored in memory until all computations are done, and then the persistent storage is updated if needed.

Parameters:
  • target_columns – list of indices for this chunk

  • target_columns – list of column names to be filled in this call

  • fill_needed – if False, results are not computed (they default to None). Instead, only data in persistent storage is used.

  • overwrite – values selected for filling are computed even if they are present in permanent storage. Otherwise, existing values are skipped from the computation.

  • run_sanity_checks – if True, sanity checks are performed on the data

  • progress_tracker – the ProgressTracker instance being used to track progress of this call, or None if None is being used, or False if no progress tracker is to be employed.

Returns:

a DataFrame instance containing the requested data

Raises:

ColumnFailedError – an error was encountered while computing the data.

get_matlab_struct_str(target_indices)

Return a string containing MATLAB code that defines a list of structs (one per target index), with the fields being the columns defined for this table.

ignored_columns = []

Column names in this list are not retrieved nor saved to persistence, even if they are defined.

property indices

If self.index is a string, it returns a list with that column name. If self.index is a list, it returns self.index. Useful to iterate homogeneously regardless of whether single or multiple indices are used.

property indices_and_columns
Returns:

a list of all defined columns, i.e., those for which a function has been defined.

load_saved_df(csv_support_path=None, run_sanity_checks=True)

Load the df stored in permanent support (if any) and return it. If not present, an empty dataset is returned instead.

Parameters:

run_sanity_checks – if True, data are verified to detect corruption. This may increase computation time, but provides an extra layer of data reliability.

Returns:

the loaded table_df, which may be empty

Raises:

CorruptedTableError – if run_run_sanity_checks is True and a problem is detected.

property name

Return the name of the table. Defaults to the table class name.

static normalize_column_function_arguments(column_property_list, fun, **kwargs)

Helper method to verify and normalize the column_property_list varargs passed to add_column_function. Each element of that list is passed as the column_properties argument to this function.

  • If the element is a string, it is interpreted as a column name, and a new ColumnProperties object is is created with that name and the fun argument to this function. The kwargs argument is passed to that initializer.

  • If the element is a enb.atable.ColumnProperties instance, it is returned without modification. The kwargs argument is ignored in this case.

  • Otherwise, a SyntaxError is raised.

Parameters:
  • column_property_list – one of the elements of the *column_property_list parameter to add_column_function.

  • fun – the function being decorated.

Returns:

a nonempty list of valid ColumnProperties instances

private_index_column = '__atable_index'

Name of the index used internally.

classmethod redefines_column(fun)

Decorator to be applied on overwriting methods that are meant to fill the same columns as the base class’ homonym method.

write_persistence(df, output_csv=None)

Dump a dataframe produced by this table into persistent storage.

Parameters:

output_csv – if None, self.csv_support_path is used as the output path.

exception enb.atable.ColumnFailedError(atable=None, index=None, column=None, msg=None, ex=None, exception_list=None)

Bases: CorruptedTableError

Raised when a function failed to fill a column.

__init__(atable=None, index=None, column=None, msg=None, ex=None, exception_list=None)
Parameters:
  • atable – atable instance that originated the problem

  • column – column where the problem happened

  • ex – main exception that lead to the problem, or None

  • exception_list – a list of exceptions related to this one, e.g., all failing columns

  • msg – message describing the problem, or None

class enb.atable.ColumnProperties(name, fun=None, label=None, plot_min=None, plot_max=None, semilog_x=False, semilog_y=False, semilog_x_base=10, semilog_y_base=10, hist_label=None, hist_min=None, hist_max=None, hist_bin_width=None, has_dict_values=False, has_iterable_values=False, has_object_values=False, hist_label_dict=None, **extra_attributes)

Bases: object

All columns defined in an enb.atable.ATable subclass have a corresponding enb.atable.ColumnProperties instance, which provides metainformation about it. Its main uses are providing plotting cues and to allow non-scalar data ( tuples, lists and dicts). Once an enb.atable.ATable subclass c is defined, c.column_to_properties contains a mapping from a column’s name to its ColumnProperties instance. It is possible to change attributes of column properties instances, and to replace the ColumnProperties instances in column_to_properties. For instance, one may want to plot a column with its original cues first, and then create a second version with semi-logarithmic axes. Then it would suffice to use enb.aanalysis tools with the enb.atable.ATable subclass default column_to_properties first, then modify one or more ColumnProperties instances, and finally apply the same tools again.

__init__(name, fun=None, label=None, plot_min=None, plot_max=None, semilog_x=False, semilog_y=False, semilog_x_base=10, semilog_y_base=10, hist_label=None, hist_min=None, hist_max=None, hist_bin_width=None, has_dict_values=False, has_iterable_values=False, has_object_values=False, hist_label_dict=None, **extra_attributes)

Column-function linking:

Parameters:
  • name – unique name that identifies a column.

  • fun – function to be invoked to fill a column value. If None, enb will set this for you when you define columns with column_ or enb.atable.column_function().

Type specification (mutually exclusive).

Parameters:
  • has_dict_values – set to True if and only if the column cells contain value mappings (i.e., dicts), as opposed to scalar values. Both keys and values should be valid scalar values (numeric, string or boolean). It cannot be True if any other type is specified.

  • has_iterable_values – set to True if and only if the column cells contain iterables, i.e., tuples or lists. It cannot be True if any other type is specified.

  • has_object_values – set to True if and only if the column cells contain general python objects that can be pickled an unpickled.

Note

The has_ast_values property of the ColumnProperties instance will return true if and only if iterable or dict values are used.

Plot rendering hints:

Parameters:
  • label – descriptive label of the column, intended to be displayed in plot (e.g., axes) labels

  • plot_min – minimum value to be plotted for the column. For histograms, this refers to the range of key (X-axis) values.

  • plot_max – minimum value to be plotted for the column. For histograms, this refers to the range of key (X-axis) values.

  • semilog_x – True if a log scale should be used in the X axis.

  • semilog_y – True if a log scale should be used in the Y axis.

  • semilog_x_base – log base to use if semilog_x is true.

  • semilog_y_base – log base to use if semilog_y is true.

Parameters specific to histograms, only applicable when has_dict_values is True.

Parameters:
  • hist_bin_width – histogram bin used when calculating distributions

  • hist_label_dict – None, or a dictionary with x-value to label dict

  • secondary_label – secondary label for the column, i.e., the Y axis of an histogram column.

  • hist_min – if not None, the minimum value to be plotted in histograms. If None, the Analyzer instance decides the range (typically (0,1)).

  • hist_max – if not None, the maximum value to be plotted in histograms. If None, the Analyzer instance decides the range (typically (0,1)).

  • hist_label – if not None, the label to be shown globally in the Y axis.

User-defined attributes:

Parameters:

extra_attributes – any parameters passed are set as attributes of the created instance (with __setattr__). These attributes are not directly used by enb’s core, but can be safely used by client code.

property has_ast_values

Determine whether this column requires ast for literal parsing, e.g., for supported non-scalar data: dicts and iterables.

exception enb.atable.CorruptedTableError(atable, ex=None, msg=None)

Bases: Exception

Raised when a table is Corrupted, e.g., when loading a CSV with missing indices.

__init__(atable, ex=None, msg=None)
Parameters:

msg – message describing the error that took place

class enb.atable.MetaTable(name, bases, dct)

Bases: type

Metaclass for enb.atable.ATable and all subclasses, which guarantees that the column_to_properties is a static OrderedDict instance different from other classes’ column_to_properties. This way, enb.atable.ATable and all subclasses can access and update their dicts separately for each class, effectively allowing to split the definition of columns across multiple enb.atable.ATable instances.

Note: Table classes should inherit from enb.atable.ATable, not enb.atable.MetaTable. You probably don’t ever need to use this class directly.

automatic_column_function_prefix = 'column_'
static get_auto_column_wrapper(fun)

Create a wrapper for a function with a signature compatible with column-setting functions, so that its returned value is assigned to the row’s column.

pendingdefs_classname_fun_columnpropertylist = []
static set_column_mro(subclass)

Redefine column properties that have been defined in the child class, making sure the expected method resolution order (MRO) is maintained.

This method prevents the parent’s method be invoked to fill in the pandas.DataFrame, following intuitive OOP definition.

class enb.atable.SummaryTable(full_df, column_to_properties=None, copy_df=False, csv_support_path=None, group_by=None, include_all_group=False)

Bases: ATable

Summary tables allow to define custom group rows of dataframes, e.g., produced by ATable subclasses, and to define new columns (measurements) for each of those groups.

Column functions can be defined in the same way as for any ATable. In this case, the index elements passed to the column functions are the group labels returned by split_groups(). Column functions can then access the corresponding dataframe with self.label_to_df[label].

Note that this behaviour is not unlike the groupby() method of pandas. The main differences are:

  • Grouping can be fully customized, instead of only allowing splitting by one or more column values

  • The newly defined columns can aggregate data in the group in any arbitrary way. This is of course true for pandas, but SummaryTable tries to gracefully integrate that process into enb, allowing automatic persistence, easy plotting, etc.

SummaryTable can be particularly useful as an intermediate step between a complex table’s (or enb.Experiment’s) get_df and the analyze_df method of analyzers en enb.aanalysis.

__init__(full_df, column_to_properties=None, copy_df=False, csv_support_path=None, group_by=None, include_all_group=False)
Initialize a summary table.

Group splitting is not invoked until needed by calling self.get_df().

Column-setting columns are given the group label and the row to be completed. They can access self.label_to_df to get the dataframe corresponding to the row’s group.

Parameters:
  • full_df – reference pandas dataframe to be summarized.

  • column_to_properties – if not None,

it should be the column_to_properties attribute

of the table that produced reference_df.

Parameters:
  • copy_df – if not True, a pointer to the original reference_df is used. Otherwise, a copy is made. Note that reference_df is typically evaluated each time split_groups() is called.

  • csv_support_path – if not None, a CSV file is used at that for persistence.

  • include_all_group – if True, a group “All” with all samples is included in the summary.

column_group_label(index, row)

Set the name of the group in a column.

column_group_size(index, row)

Number of elements (rows from full_df) in the group.

column_to_properties = {'group_label': ColumnProperties('name'='group_label', 'fun'=<function SummaryTable.column_group_label>, 'label'='Group label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'group_size': ColumnProperties('name'='group_size', 'fun'=<function SummaryTable.column_group_size>, 'label'='Group size', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

get_df(*args, reference_df=None, include_all_group=None, **kwargs)

Get the summary dataframe. This class only defines the ‘group_size’ column for the output dataframe. Subclasses may add as many columns to the summary as desired.

Parameters:

reference_df – if not None, the dataframe to be used as reference for the summary. If None, the one provided at initialization is used.

Returns:

the summary dataframe with all columns defined for self’s class.

split_groups(reference_df=None, include_all_group=None)

Split the reference_df pandas.DataFrame into an iterable of (label, dataframe) tuples. This splitting is performed based on the value of self.group_by:

  • If it is None, a single group labelled “all” is created, associated to reference_df.

  • If it is not None:
    • It can be a pandas.DataFrame column index, e.g., a column name or a list of column names. In this case, the result pandas’ groupby is returned.

    • It can be a callable with a single argument reference_df. In this case, the result of calling that method with reference_df as argument is returned by the call to split_groups().

Subclasses can easily implement grouping custom grouping methods, which must adhere to the following constraints: - It must return an iterable of group_label, group_df tuples. - Unique group_label values must be returned.

Also note that:

  • It is NOT needed that the union of all group_df tuples yield reference_df.

  • It is NOT needed that the intersection of the any two group_df elements is empty.

  • The group_df dataframes normally contain all columns in reference_df, but it is NOT mandatory to maintain this behavior.

Parameters:
  • reference_df – if not None, a reference dataframe to split. If None, self.reference_df is employed instead.

  • include_all_group – if True, an “All” group is added, containing all input samples, regardless of the groups produced based on groupby. If None, self’s class is queried for that attribute.

Returns:

an iterable of (label, dataframe) tuples.

enb.atable.check_unique_indices(df)

Verify that df has no duplicated indices, or raise a CorruptedTableError.

enb.atable.clean_column_name(column_name)

Return a cleaned version of the column name, more indicated for display.

enb.atable.column_function(*column_property_list, **kwargs)

New columns can be added to enb.atable.ATable subclasses by decorating them with @enb.atable.column_function, e.g., with code similar to the following:

class TableA(enb.atable.ATable):
@enb.atable.column_function("uppercase", label="Uppercase version of the index")
def set_character_sum(self, index, row):
    row["uppercase"] = index.upper()

The column_property_list argument can be one of the following options:

  • one or more strings, which are interpreted as the new column(s)’ name(s). For example:

    class TableC(enb.atable.ATable):
    @enb.atable.column_function("uppercase", "lowercase")
    def set_character_sum(self, index, row):
        row["uppercase"] = index.upper()
        row["lowercase"] = index.lower()
    
  • one or more enb.atable.ColumnProperties instances, one for each defined column.

  • a list of enb.atable.ColumnProperties instances, e.g., by invoking @column_function([cp1,cp2]), where cp1 and cp2 are enb.atable.ColumnProperties instances. This option is deprecated and provided for backwards compatibility only. If properties=[cp1,cp2], then @column_function(l) (deprecated) and @column_function(*l) should result in identical column definitions.

Decorator to allow definition of table columns for still undefined classes. To do so, MetaTable keeps track of enb.atable.column_function()-decorated methods while the class is being defined. Then, when the class is created, MetaTable adds the columns defined by the decorated functions.

Parameters:

column_property_list – list of column property definitions, as described above.

Returns:

the wrapper that actually decorates the function using the column_property_list and kwargs parameters.

enb.atable.get_all_input_files(ext=None, base_dataset_dir=None)

Get a list of all input files (recursively) contained in base_dataset_dir.

Parameters:
  • ext – if not None, only files with names ending with ext will be

  • base_dataset_dir – if not None, the dir where test files are searched for recursively. If None, options.base_dataset_dir is used instead.

Returns:

the sorted list of canonical paths to the found input files.

enb.atable.get_canonical_path(file_path)
Returns:

the canonical version of a path to be stored in the database, to make sure indexing is consistent across code using enb.atable.ATable and its subclasses.

enb.atable.get_class_that_defined_method(meth)

From the great answer at https://stackoverflow.com/questions/3589311/get-defining-class-of-unbound-method-object-in-python-3/25959545#25959545

enb.atable.get_nonscalar_value(cell_value)

Parse a pandas.DataFrame’s cell value in a column declared to contain non-scalar types, i.e., dict, list or tuple. Return an instance of one of those types.

If cell_value is a string, ast is employed to parse it. If cell_Value is a dict, list or tuple, it is returned without modification. Otherwise, an error is raised.

Note that enb.atable.ATable subclasses produce dataframes with the intended data types also for non-scalar types. This method is provided as a convenience tool for the case when raw CSV files produced by enb are read directly, and not through enb.atable.ATable’s persistence system.

enb.atable.indices_to_internal_loc(values)

Given an index string or list of strings, return a single index string that uniquely identifies those strings and can be used as an internal index.

This is used internally to set the actual pandas.DataFrame’s index value to a unique value that represents the row’s index. Note that pandas.DataFrame’s subindexing is intentionally not used to maintain a simple, flat structure of tables without nesting.

Returns:

a unique string for indexing given the input values

enb.atable.parallel_compute_one_row(atable_instance, filtered_df, index, loc, column_fun_tuples, overwrite)

Ray wrapper for ATable.process_row()

enb.atable.redefines_column(fun)

When an enb.atable.ATable subclass defines a method with the same name as any of the parent classes, and when that method defines a column, it must be decorated with this.

Otherwise, a SyntaxError is raised. This is to prevent hard-to-find bugs where a parent class’ definition of the method is used when filling a row’s column, but calling that method on the child’s instance runs the child’s code.

Functions decorated with this method acquire a _redefines_column attribute, that is then identified by enb.atable.ATable.add_column_function(), i.e., the method responsible for creating columns.

Note that _redefines_column attributes for non-column and non-overwritting methods are not employed by enb thereafter.

Parameters:

fun – rewriting function being decorated

enb.atable.string_or_float(cell_value)

Takes the input value from an enb.atable.ATable cell and returns either its float value or its string value. In the latter case, one level of surrounding ‘ or “ is removed from the value before returning. :return: the string or float value given by cell_value

enb.atable.unpack_index_value(index)

Unpack an enb-created pandas.DataFrame index and return its elements. This can be useful to iterate homogeneously regardless of whether single or multiple indices are used.

Returns:

If input is a string, it returns a list with that column name. If input is a list, it returns self.index.

enb.experiment module

Tools to run compression experiments

class enb.experiment.Experiment(tasks, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, task_families=None)

Bases: ATable

An Experiment allows seamless execution of one or more tasks upon a corpus of test files. Tasks are identified by a file index and a ExperimentTask’s (unique) name.

For each task, any number of table columns can be defined to gather results of interest. This allows easy extension to highly complex and/or specific experiments. See set_task_name() for an example.

Automatic persistence of the obtained results is provided, to allow faster experiment development and result replication.

Internally, an Experiment consists of an ATable containing the properties of the test corpus, and is itself another ATable that contains the results of an experiment with as many user-defined columns as needed. When the get_df() method is called, a joint DataFrame instance is returned that contains both the experiment results, and the associated metainformation columns for each corpus element.

__init__(tasks, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, task_families=None)
Parameters:
  • tasks – an iterable of ExperimentTask instances. Each test file is processed by all defined tasks. For each (file, task) combination, a row is included in the table returned by get_df().

  • dataset_paths – list of paths to the files to be used as input. If it is None, this list is obtained automatically calling enb.sets.get_all_test_file()

  • csv_experiment_path – if not None, path to the CSV file giving persistence support to this experiment. If None, it is automatically determined within options.persistence_dir.

  • csv_dataset_path – if not None, path to the CSV file given persistence support to the dataset file properties. If None, it is automatically determined within options.persistence_dir.

  • dataset_info_table – if not None, it must be a enb.sets.FilePropertiesTable instance or subclass instance that can be used to obtain dataset file metainformation, and/or gather it from csv_dataset_path. If None, a new enb.sets.FilePropertiesTable instance is created and used for this purpose. This parameter can also be a class (instead of an instance). In this case, it the initializer is asumed to accept a csv_support_path argument and be compatible with the enb.sets.FilePropertiesTable interface.

  • overwrite_file_properties – if True, file properties are necessarily computed before starting the experiment. This can be useful for temporary and/or random datasets. If False, file properties are loaded from the persistent storage when available. Note that this parameter does not affect whether experiment results are retrieved from persistent storage if available. This is controlled via the parameters passed to get_df()

  • task_families – if not None, it must be a list of TaskFamily instances. It is used to set the “family_label” column for each row. If the codec is not found within the families, a default label is set indicating so.

column_to_properties = {'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

default_file_properties_table_class

alias of FilePropertiesTable

get_dataset_df()

Get the DataFrame of the employed dataset.

get_dataset_info_row(file_path)

Get the dataset info table row for the file path given as argument.

get_df(target_indices=None, target_columns=None, fill=True, overwrite=None, chunk_size=None)

Get a DataFrame with the results of the experiment. The produced DataFrame contains the columns from the dataset info table (but they are not stored in the experiment’s persistence file).

Parameters:
  • target_indices – list of file paths to be processed. If None, self.target_file_paths is used instead.

  • chunk_size – if not None, a positive integer that determines the number of table rows that are processed before made persistent.

  • overwrite – if not None, a flag determining whether existing values should be calculated again. If none, options

index_to_path_task(index)

Given an Experiment’s row index, return (path, task), where path is the canonical path of the row’s dataset element, and task is the task instance corresponding to that row.

property joined_column_to_properties

Get the combined dictionary of enb.atable.ColumnProperties indexed by column name. This dictionary contains the dataset properties columns and the experiment columns.

Note that column_to_properties returns only the experiment columns.

no_family_label = 'No family'
set_family_label(index, row)

Set the label of the family to which this row’s task belong, or set it to self.no_family_label

set_param_dict(index, row)

Store the task’s param dict for easy reference and access to the param values.

set_task_apply_time(index, row)

Run the task.apply() method and store its process time.

set_task_label(index, row)

Set the label of the task for this row.

set_task_name(index, row)

Set the name of the task for this row.

task_apply_time_column = 'task_apply_time'
task_label_column = 'task_label'
task_name_column = 'task_name'
class enb.experiment.ExperimentTask(param_dict=None)

Bases: object

Identify an Experiment’s task and its configuration.

__init__(param_dict=None)
Parameters:

param_dict – dictionary of configuration parameters used for this task. By default, they are part of the task’s name.

apply(experiment, index, row)

This method is called for all combinations of input indices and tasks before computing any additional column.

Parameters:
  • experiment – experiment invoking this method

  • index – index being processed by the experiment. Consider experiment.index_to_path_task(index).

  • row – row being currently processed by the experiment

property label

Label to be displayed for the task in output reports. May not be strictly unique nor fully informative. By default, the task’s name is returned.

property name

Unique name that uniquely identifies a task and its configuration. It is not intended to be displayed as much as to be used as an index.

class enb.experiment.TaskFamily(label, task_names=None, name_to_label=None)

Bases: object

Describe a sorted list of task names that identify a family of related results within a DataFrame. Typically, this family will be constructed using task workers (e.g., enb.icompression.AbstractCodec instances) that share all configuration values except for a parameter.

__init__(label, task_names=None, name_to_label=None)
Parameters:
  • label – Printable name that identifies the family

  • task_names – if not None, it must be a list of task names ( strings) that are expected to be found in an ATable’s DataFrame when analyzing it.

  • name_to_label – if not None, it must be a dictionary indexed by task name that contains a displayable version of it.

add_task(task_name, task_label=None)

Add a new task name to the family (it becomes the last element in self.task_names)

Parameters:

task_name – A new not previously included in the Family

enb.fits module

FITS format manipulation tools. See https://fits.gsfc.nasa.gov/fits_documentation.html.

class enb.fits.FITSVersionTable(original_base_dir, version_base_dir)

Bases: FileVersionTable, FilePropertiesTable

Read FITS files and convert them to raw files, sorting them by type ( integer or float) and by bits per pixel.

__init__(original_base_dir, version_base_dir)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, options.base_datset_dir is used

allowed_extensions = ['fit', 'fits']
column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

fits_extension = 'fit'
get_default_target_indices()

Get the list of samples in self.original_base_dir and its subdirs that have extension self.dataset_files_extension.

original_to_versioned_path(original_path)

Get the path of the versioned file corresponding to original_path. This function will replicate the folder structure within self.original_base_dir.

set_version_repetitions(file_path, row)

Set the number of times the versioning process is performed.

version(input_path, output_path, row)

Create a version of input_path and write it into output_path.

Parameters:
  • input_path – path to the file to be versioned

  • output_path – path where the version should be saved

  • row – metainformation available using super().get_df for input_path

Returns:

if not None, the time in seconds it took to perform the ( forward) versioning.

version_name = 'FitsToRaw'
class enb.fits.FITSWrapperCodec(compressor_path, decompressor_path, param_dict=None, output_invocation_dir=None, signature_in_name=False)

Bases: WrapperCodec

Raw images are coded into FITS before compression with the wrapper, and FITS is decoded to raw after decompression.

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

enb.icompression module

Image compression experiment module.

class enb.icompression.AbstractCodec(param_dict=None)

Bases: ExperimentTask

Base class for all codecs

__init__(param_dict=None)
Parameters:

param_dict – dictionary of configuration parameters used for this task. By default, they are part of the task’s name.

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

compression_results_from_paths(original_path, compressed_path)

Get the default CompressionResults instance corresponding to the compression of original_path into compressed_path

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

decompression_results_from_paths(compressed_path, reconstructed_path)

Return a enb.icompression.DecompressionResults instance given the compressed and reconstructed paths.

property label

Label to be displayed for the codec. May not be strictly unique nor fully informative. By default, self’s class name is returned.

property name

Name of the codec. Subclasses are expected to yield different values when different parameters are used. By default, the class name is folled by all elements in self.param_dict sorted alphabetically are included in the name.

exception enb.icompression.CompressionException(original_path=None, compressed_path=None, file_info=None, status=None, output=None)

Bases: Exception

Base class for exceptions occurred during a compression instance

__init__(original_path=None, compressed_path=None, file_info=None, status=None, output=None)
class enb.icompression.CompressionExperiment(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)

Bases: Experiment

This class allows seamless execution of compression experiments.

In the functions decorated with @atable,column_function, the row argument contains two magic properties, compression_results and decompression_results. These give access to the CompressionResults and DecompressionResults instances resulting respectively from compressing and decompressing according to the row index parameters. The paths referenced in the compression and decompression results are valid while the row is being processed, and are disposed of afterwards. Also, the image_info_row attribute gives access to the image metainformation (e.g., geometry)

class CompressionDecompressionWrapper(file_path, codec, image_info_row, reconstructed_copy_dir=None, compressed_copy_dir=None)

Bases: object

This class is instantiated for each row of the table, and added to a temporary column row_wrapper_column_name. Column-setting methods can then access this wrapper, and in particular its compression_results and decompression_results properties, which will run compression and decompression at most once. This way, many columns can be defined independently without needing to compress and decompress for each one.

__init__(file_path, codec, image_info_row, reconstructed_copy_dir=None, compressed_copy_dir=None)
Parameters:
  • file_path – path to the original image being compressed

  • codec – AbstractCodec instance to be used for compression/decompression

  • image_info_row – dict-like object with geometry and data type information about file_path

  • reconstructed_copy_dir – if not None, a copy of the reconstructed images is stored, based on the class of codec.

  • compressed_copy_dir – if not None, a copy of the compressed images is stored, based on the class of codec.

property compression_results

Perform the actual compression experiment for the selected row.

property decompression_results

Perform the actual decompression experiment for the selected row.

property numpy_dtype

Get the numpy dtype corresponding to the original image’s data format

__init__(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)
Parameters:
  • codecs – list of AbstractCodec instances. Note that codecs are compatible with the interface of ExperimentTask.

  • dataset_paths – list of paths to the files to be used as input for compression. If it is None, this list is obtained automatically from the configured base dataset dir.

  • csv_experiment_path – if not None, path to the CSV file giving persistence support to this experiment. If None, it is automatically determined within options.persistence_dir.

  • csv_dataset_path – if not None, path to the CSV file given persistence support to the dataset file properties. If None, it is automatically determined within options.persistence_dir.

  • dataset_info_table – if not None, it must be a ImagePropertiesTable instance or subclass instance that can be used to obtain dataset file metainformation, and/or gather it from csv_dataset_path. If None, a new ImagePropertiesTable instance is created and used for this purpose.

  • overwrite_file_properties – if True, file properties are recomputed before starting the experiment. Useful for temporary and/or random datasets. Note that overwrite control for the experiment results themselves is controlled in the call to get_df

  • reconstructed_dir_path – if not None, a directory where reconstructed images are to be stored.

  • compressed_copy_dir_path – if not None, it gives the directory where a copy of the compressed images. is to be stored. If may not be generated for images for which all columns are known

  • task_families – if not None, it must be a list of TaskFamily instances. It is used to set the “family_label” column for each row. If the codec is not found within the families, a default label is set indicating so.

property codecs
Returns:

an iterable of defined codecs

property codecs_by_name

Alias for tasks_by_name

column_to_properties = {'bpppc': ColumnProperties('name'='bpppc', 'fun'=<function CompressionExperiment.set_bpppc>, 'label'='Compressed data rate (bpppc)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_file_sha256': ColumnProperties('name'='compressed_file_sha256', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'="Compressed file's SHA256", 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_size_bytes': ColumnProperties('name'='compressed_size_bytes', 'fun'=<function CompressionExperiment.set_compressed_data_size>, 'label'='Compressed data size (Bytes)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_1byte_entropy': ColumnProperties('name'='compression_efficiency_1byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (1B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_2byte_entropy': ColumnProperties('name'='compression_efficiency_2byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (2B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_memory_kb': ColumnProperties('name'='compression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio': ColumnProperties('name'='compression_ratio', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio_dr': ColumnProperties('name'='compression_ratio_dr', 'fun'=<function CompressionExperiment.set_compression_ratio_dr>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_time_seconds': ColumnProperties('name'='compression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_memory_kb': ColumnProperties('name'='decompression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_time_seconds': ColumnProperties('name'='decompression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'lossless_reconstruction': ColumnProperties('name'='lossless_reconstruction', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Lossless?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'repetitions': ColumnProperties('name'='repetitions', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Number of compression/decompression repetitions', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

property compression_results: CompressionResults

Get the current compression results from self.codec_results. This property is intended to be read from functions that set columns of a row. It triggers the compression of that row’s sample with that row’s codec if it hasn’t been compressed yet. Otherwise, None is returned.

compute_one_row(filtered_df, index, loc, column_fun_tuples, overwrite)

Process a single row of an ATable instance, returning a Series object corresponding to that row. If an error is detected, an exception is returned instead of a Series. Note that the exception is not raised here, but intended to be detected by the compute_target_rows(), i.e., the dispatcher function.

Parameters:
  • filtered_dfpandas.DataFrame retrieved from persistent storage, with index compatible with loc. The loc argument itself needs not be present in filtered_df, but is used to avoid recomputing in case overwrite is not True and columns had been set.

  • index – index value or values corresponding to the row to be processed.

  • loc – location compatible with .loc of filtered_df (although it might not be present there), and that will be set into the full loaded_df also using its .loc accessor.

  • column_fun_tuples – a list of (column, fun) tuples, where fun is to be invoked to fill column

  • overwrite – if True, existing values are overwritten with newly computed data

Returns:

a pandas.Series instance corresponding to this row, with a column named as given by self.private_index_column set to the loc argument passed to this function.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

property decompression_results: DecompressionResults

Get the current decompression results from self.codec_results. This property is intended to be read from functions that set columns of a row. It triggers the compression and decompression of that row’s sample with that row’s codec if they have not been compressed yet. Otherwise, None is returned.

default_file_properties_table_class

alias of ImagePropertiesTable

row_wrapper_column_name = '_codec_wrapper'
set_bpppc(index, row)
set_comparison_results(index, row)

Perform a compression-decompression cycle and store the comparison results

set_compressed_data_size(index, row)
set_compression_ratio_dr(index, row)

Set the compression ratio calculated based on the dynamic range of the input samples, as opposed to 8*bytes_per_sample.

set_efficiency(index, row)
class enb.icompression.CompressionResults(codec_name=None, codec_param_dict=None, original_path=None, compressed_path=None, side_info_files=None, compression_time_seconds=None, maximum_memory_kb=None)

Bases: object

Base class that defines the minimal fields that are returned by a call to a coder’s compress() method (or produced by the CompressionExperiment instance).

__init__(codec_name=None, codec_param_dict=None, original_path=None, compressed_path=None, side_info_files=None, compression_time_seconds=None, maximum_memory_kb=None)
Parameters:
  • codec_name – codec’s reported_name

  • codec_param_dict – dictionary of parameters to the codec

  • original_path – path to the input original file

  • compressed_path – path to the output compressed file

  • side_info_files – list of file paths with side information

  • compression_time_seconds – effective average compression time in seconds

  • maximum_memory_kb – maximum resident memory in kilobytes

exception enb.icompression.DecompressionException(compressed_path=None, reconstructed_path=None, file_info=None, status=None, output=None)

Bases: Exception

Base class for exceptions occurred during a decompression instance

__init__(compressed_path=None, reconstructed_path=None, file_info=None, status=None, output=None)
class enb.icompression.DecompressionResults(codec_name=None, codec_param_dict=None, compressed_path=None, reconstructed_path=None, side_info_files=None, decompression_time_seconds=None, maximum_memory_kb=None)

Bases: object

Base class that defines the minimal fields that are returned by a call to a coder’s decompress() method (or produced by the CompressionExperiment instance).

__init__(codec_name=None, codec_param_dict=None, compressed_path=None, reconstructed_path=None, side_info_files=None, decompression_time_seconds=None, maximum_memory_kb=None)
Parameters:
  • codec_name – codec’s reported_name

  • codec_param_dict – dictionary of parameters to the codec

  • compressed_path – path to the output compressed file

  • reconstructed_path – path to the reconstructed file after decompression

  • side_info_files – list of file paths with side information

  • decompression_time_seconds – effective decompression time in seconds

  • maximum_memory_kb – maximum resident memory in kilobytes

class enb.icompression.GeneralLosslessExperiment(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)

Bases: LosslessCompressionExperiment

Lossless compression experiment for general data contents.

class GenericFilePropertiesTable(csv_support_path=None, base_dir=None)

Bases: ImagePropertiesTable

File properties table that considers the input path as a 1D, u8be array.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_image_geometry>, 'label'='Big endian', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FilePropertiesTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dynamic_range_bits': ColumnProperties('name'='dynamic_range_bits', 'fun'=<function ImagePropertiesTable.set_dynamic_range_bits>, 'label'='Dynamic range (bits)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'entropy_1B_bps': ColumnProperties('name'='entropy_1B_bps', 'fun'=<function ImagePropertiesTable.set_file_entropy>, 'label'='Entropy (bits, 1-byte samples)', 'plot_min'=0, 'plot_max'=8, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'entropy_2B_bps': ColumnProperties('name'='entropy_2B_bps', 'fun'=<function ImagePropertiesTable.set_file_entropy>, 'label'='Entropy (bits, 2-byte samples)', 'plot_min'=0, 'plot_max'=16, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_image_geometry>, 'label'='Float', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sample_max': ColumnProperties('name'='sample_max', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_sample_extrema>, 'label'='Max sample value (byte samples)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sample_min': ColumnProperties('name'='sample_min', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_sample_extrema>, 'label'='Min sample value (byte samples)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function GeneralLosslessExperiment.GenericFilePropertiesTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

set_bytes_per_sample(file_path, row)

Infer the number of bytes per sample based from the file path.

set_image_geometry(file_path, row)

Obtain the image’s geometry (width, height and number of components) based on the filename tags (and possibly its size)

set_sample_extrema(file_path, row)

Set the minimum and maximum byte value extrema.

verify_file_size = False
column_to_properties = {'bpppc': ColumnProperties('name'='bpppc', 'fun'=<function CompressionExperiment.set_bpppc>, 'label'='Compressed data rate (bpppc)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_file_sha256': ColumnProperties('name'='compressed_file_sha256', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'="Compressed file's SHA256", 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_size_bytes': ColumnProperties('name'='compressed_size_bytes', 'fun'=<function CompressionExperiment.set_compressed_data_size>, 'label'='Compressed data size (Bytes)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_1byte_entropy': ColumnProperties('name'='compression_efficiency_1byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (1B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_2byte_entropy': ColumnProperties('name'='compression_efficiency_2byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (2B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_memory_kb': ColumnProperties('name'='compression_memory_kb', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Compression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio': ColumnProperties('name'='compression_ratio', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio_dr': ColumnProperties('name'='compression_ratio_dr', 'fun'=<function CompressionExperiment.set_compression_ratio_dr>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_time_seconds': ColumnProperties('name'='compression_time_seconds', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Compression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_memory_kb': ColumnProperties('name'='decompression_memory_kb', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Decompression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_time_seconds': ColumnProperties('name'='decompression_time_seconds', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Decompression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'lossless_reconstruction': ColumnProperties('name'='lossless_reconstruction', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Lossless?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'repetitions': ColumnProperties('name'='repetitions', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Number of compression/decompression repetitions', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = ''

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

default_file_properties_table_class

alias of GenericFilePropertiesTable

class enb.icompression.GiciLibHelper

Bases: object

Definition of helper methods that can be used with software based on the GiciLibs (see gici.uab.cat/GiciWebPage/downloads.php).

file_info_to_data_str(original_file_info)
file_info_to_endianness_str(original_file_info)
get_gici_geometry_str(original_file_info)

Get a string to be passed to the -ig or -og parameters. The ‘-ig’ or ‘-og’ part is not included in the returned string.

class enb.icompression.JavaWrapperCodec(compressor_jar, decompressor_jar, param_dict=None)

Bases: WrapperCodec

Wrapper for *.jar codecs. The compression and decompression parameters are those that need to be passed to the ‘java’ command.

The compressor_jar and decompressor_jar attributes are added upon initialization based on the params to __init__.

__init__(compressor_jar, decompressor_jar, param_dict=None)
Parameters:
  • compressor_path – path to the executable to be used for compression

  • decompressor_path – path to the executable to be used for decompression

  • param_dict – name-value mapping of the parameters to be used for compression

  • output_invocation_dir – if not None, invocation strings are stored in this directory with name based on the codec and the image’s full path.

Pram signature_in_name:

if True, the default codec name includes part of the hexdigest of the compressor and decompressor binaries being used

class enb.icompression.LittleEndianWrapper(compressor_path, decompressor_path, param_dict=None, output_invocation_dir=None, signature_in_name=False)

Bases: WrapperCodec

Wrapper with identical semantics as WrapperCodec, but performs a big endian to little endian conversion for (big-endian) 2-byte and 4-byte samples. If the input is flagged as little endian, e.g., if -u16le- is in the original file name, then no transformation is performed.

Codecs inheriting from this class automatically receive little-endian samples, and are expected to reconstruct little-endian files (which are then translated back to big endian if and only if the original image was flagged as big endian.

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

class enb.icompression.LosslessCodec(param_dict=None)

Bases: AbstractCodec

An AbstractCodec that identifies itself as lossless.

class enb.icompression.LosslessCompressionExperiment(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)

Bases: CompressionExperiment

Lossless compression of raw image files. The experiment fails if lossless compression is not attained.

column_to_properties = {'bpppc': ColumnProperties('name'='bpppc', 'fun'=<function CompressionExperiment.set_bpppc>, 'label'='Compressed data rate (bpppc)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_file_sha256': ColumnProperties('name'='compressed_file_sha256', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'="Compressed file's SHA256", 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_size_bytes': ColumnProperties('name'='compressed_size_bytes', 'fun'=<function CompressionExperiment.set_compressed_data_size>, 'label'='Compressed data size (Bytes)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_1byte_entropy': ColumnProperties('name'='compression_efficiency_1byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (1B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_2byte_entropy': ColumnProperties('name'='compression_efficiency_2byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (2B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_memory_kb': ColumnProperties('name'='compression_memory_kb', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Compression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio': ColumnProperties('name'='compression_ratio', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio_dr': ColumnProperties('name'='compression_ratio_dr', 'fun'=<function CompressionExperiment.set_compression_ratio_dr>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_time_seconds': ColumnProperties('name'='compression_time_seconds', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Compression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_memory_kb': ColumnProperties('name'='decompression_memory_kb', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Decompression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_time_seconds': ColumnProperties('name'='decompression_time_seconds', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Decompression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'lossless_reconstruction': ColumnProperties('name'='lossless_reconstruction', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Lossless?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'repetitions': ColumnProperties('name'='repetitions', 'fun'=<function LosslessCompressionExperiment.set_comparison_results>, 'label'='Number of compression/decompression repetitions', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

set_comparison_results(index, row)

Perform a compression-decompression cycle and store the comparison results

class enb.icompression.LossyCodec(param_dict=None)

Bases: AbstractCodec

An AbstractCodec that identifies itself as lossy.

class enb.icompression.LossyCompressionExperiment(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)

Bases: CompressionExperiment

Lossy compression of raw image files.

column_to_properties = {'bpppc': ColumnProperties('name'='bpppc', 'fun'=<function CompressionExperiment.set_bpppc>, 'label'='Compressed data rate (bpppc)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_file_sha256': ColumnProperties('name'='compressed_file_sha256', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'="Compressed file's SHA256", 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_size_bytes': ColumnProperties('name'='compressed_size_bytes', 'fun'=<function CompressionExperiment.set_compressed_data_size>, 'label'='Compressed data size (Bytes)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_1byte_entropy': ColumnProperties('name'='compression_efficiency_1byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (1B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_2byte_entropy': ColumnProperties('name'='compression_efficiency_2byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (2B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_memory_kb': ColumnProperties('name'='compression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio': ColumnProperties('name'='compression_ratio', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio_dr': ColumnProperties('name'='compression_ratio_dr', 'fun'=<function CompressionExperiment.set_compression_ratio_dr>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_time_seconds': ColumnProperties('name'='compression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_memory_kb': ColumnProperties('name'='decompression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_time_seconds': ColumnProperties('name'='decompression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'lossless_reconstruction': ColumnProperties('name'='lossless_reconstruction', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Lossless?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'mse': ColumnProperties('name'='mse', 'fun'=<function LossyCompressionExperiment.set_MSE>, 'label'='MSE', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'pae': ColumnProperties('name'='pae', 'fun'=<function LossyCompressionExperiment.set_PAE>, 'label'='PAE', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'psnr_bps': ColumnProperties('name'='psnr_bps', 'fun'=<function LossyCompressionExperiment.set_PSNR_nominal>, 'label'='PSNR (dB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'psnr_dr': ColumnProperties('name'='psnr_dr', 'fun'=<function LossyCompressionExperiment.set_PSNR_dynamic_range>, 'label'='PSNR (dB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'repetitions': ColumnProperties('name'='repetitions', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Number of compression/decompression repetitions', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

set_MSE(index, row)

Set the mean squared error of the reconstructed image.

set_PAE(index, row)

Set the peak absolute error (maximum absolute pixelwise difference) of the reconstructed image.

set_PSNR_dynamic_range(index, row)

Set the PSNR assuming dynamic range given by dynamic_range_bits.

set_PSNR_nominal(index, row)

Set the PSNR assuming nominal dynamic range given by bytes_per_sample.

class enb.icompression.NearLosslessCodec(param_dict=None)

Bases: LossyCodec

An AbstractCodec that identifies itself as near lossless.

class enb.icompression.QuantizationWrapperCodec(codec: AbstractCodec, qstep: int)

Bases: NearLosslessCodec

Perform uniform scalar quantization before compressing and after decompressing with a wrapped codec instance. Midpoint reconstruction is used in the dequantization stage.

__init__(codec: AbstractCodec, qstep: int)
Parameters:
  • codec – The codec instance used to compress and decompress the quantized data.

  • qstep – The quantization interval length

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

property label

Return the original codec label and the quantization parameter.

property name

Return the original codec name and the quantization parameter

class enb.icompression.SpectralAngleTable(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)

Bases: LossyCompressionExperiment

Lossy compression experiment that computes spectral angle “distance” measures between the compressed and the reconstructed images.

Subclasses of LossyCompressionExperiment may inherit from this one to automatically add the data columns defined here

column_to_properties = {'bpppc': ColumnProperties('name'='bpppc', 'fun'=<function CompressionExperiment.set_bpppc>, 'label'='Compressed data rate (bpppc)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_file_sha256': ColumnProperties('name'='compressed_file_sha256', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'="Compressed file's SHA256", 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_size_bytes': ColumnProperties('name'='compressed_size_bytes', 'fun'=<function CompressionExperiment.set_compressed_data_size>, 'label'='Compressed data size (Bytes)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_1byte_entropy': ColumnProperties('name'='compression_efficiency_1byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (1B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_2byte_entropy': ColumnProperties('name'='compression_efficiency_2byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (2B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_memory_kb': ColumnProperties('name'='compression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio': ColumnProperties('name'='compression_ratio', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio_dr': ColumnProperties('name'='compression_ratio_dr', 'fun'=<function CompressionExperiment.set_compression_ratio_dr>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_time_seconds': ColumnProperties('name'='compression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_memory_kb': ColumnProperties('name'='decompression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_time_seconds': ColumnProperties('name'='decompression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'lossless_reconstruction': ColumnProperties('name'='lossless_reconstruction', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Lossless?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'max_spectral_angle_deg': ColumnProperties('name'='max_spectral_angle_deg', 'fun'=<function SpectralAngleTable.set_spectral_distances>, 'label'='Max spectral angle (deg)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'mean_spectral_angle_deg': ColumnProperties('name'='mean_spectral_angle_deg', 'fun'=<function SpectralAngleTable.set_spectral_distances>, 'label'='Mean spectral angle (deg)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'mse': ColumnProperties('name'='mse', 'fun'=<function LossyCompressionExperiment.set_MSE>, 'label'='MSE', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'pae': ColumnProperties('name'='pae', 'fun'=<function LossyCompressionExperiment.set_PAE>, 'label'='PAE', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'psnr_bps': ColumnProperties('name'='psnr_bps', 'fun'=<function LossyCompressionExperiment.set_PSNR_nominal>, 'label'='PSNR (dB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'psnr_dr': ColumnProperties('name'='psnr_dr', 'fun'=<function LossyCompressionExperiment.set_PSNR_dynamic_range>, 'label'='PSNR (dB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'repetitions': ColumnProperties('name'='repetitions', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Number of compression/decompression repetitions', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

get_spectral_angles_deg(index, row)

Return a sequence of spectral angles (in degrees), one per (x,y) position in the image, flattened in raster order.

set_spectral_distances(index, row)
class enb.icompression.StructuralSimilarity(codecs, dataset_paths=None, csv_experiment_path=None, csv_dataset_path=None, dataset_info_table=None, overwrite_file_properties=False, reconstructed_dir_path=None, compressed_copy_dir_path=None, task_families=None)

Bases: CompressionExperiment

Set the Structural Similarity (SSIM) and Multi-Scale Structural Similarity metrics (MS-SSIM) to measure the similarity between two images.

Authors:
column_to_properties = {'bpppc': ColumnProperties('name'='bpppc', 'fun'=<function CompressionExperiment.set_bpppc>, 'label'='Compressed data rate (bpppc)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_file_sha256': ColumnProperties('name'='compressed_file_sha256', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'="Compressed file's SHA256", 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compressed_size_bytes': ColumnProperties('name'='compressed_size_bytes', 'fun'=<function CompressionExperiment.set_compressed_data_size>, 'label'='Compressed data size (Bytes)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_1byte_entropy': ColumnProperties('name'='compression_efficiency_1byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (1B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_efficiency_2byte_entropy': ColumnProperties('name'='compression_efficiency_2byte_entropy', 'fun'=<function CompressionExperiment.set_efficiency>, 'label'='Compression efficiency (2B entropy)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_memory_kb': ColumnProperties('name'='compression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio': ColumnProperties('name'='compression_ratio', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_ratio_dr': ColumnProperties('name'='compression_ratio_dr', 'fun'=<function CompressionExperiment.set_compression_ratio_dr>, 'label'='Compression ratio', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'compression_time_seconds': ColumnProperties('name'='compression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Compression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_memory_kb': ColumnProperties('name'='decompression_memory_kb', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression memory usage (KB)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'decompression_time_seconds': ColumnProperties('name'='decompression_time_seconds', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Decompression time (s)', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'family_label': ColumnProperties('name'='family_label', 'fun'=<function Experiment.set_family_label>, 'label'='Family label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'lossless_reconstruction': ColumnProperties('name'='lossless_reconstruction', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Lossless?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'ms_ssim': ColumnProperties('name'='ms_ssim', 'fun'=<function StructuralSimilarity.set_StructuralSimilarity>, 'label'='MS-SSIM', 'plot_max'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'param_dict': ColumnProperties('name'='param_dict', 'fun'=<function Experiment.set_param_dict>, 'label'='Param dict', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'repetitions': ColumnProperties('name'='repetitions', 'fun'=<function CompressionExperiment.set_comparison_results>, 'label'='Number of compression/decompression repetitions', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'ssim': ColumnProperties('name'='ssim', 'fun'=<function StructuralSimilarity.set_StructuralSimilarity>, 'label'='SSIM', 'plot_max'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_apply_time': ColumnProperties('name'='task_apply_time', 'fun'=<function Experiment.set_task_apply_time>, 'label'='Task apply time', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_label': ColumnProperties('name'='task_label', 'fun'=<function Experiment.set_task_label>, 'label'='Task label', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'task_name': ColumnProperties('name'='task_name', 'fun'=<function Experiment.set_task_name>, 'label'='Task name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

compute_SSIM(img1, img2, max_val=255, filter_size=11, filter_sigma=1.5, k1=0.01, k2=0.03, full=False)

Return the Structural Similarity Map between img1 and img2.

This function attempts to match the functionality of ssim_index_new.m by Zhou Wang: http://www.cns.nyu.edu/~lcv/ssim/msssim.zip

Author’s Python implementation: https://github.com/dashayushman/TAC-GAN/blob/master/msssim.py

Parameters:
  • img1 – Numpy array holding the first RGB image batch.

  • img2 – Numpy array holding the second RGB image batch.

  • max_val – the dynamic range of the images (i.e., the difference between the maximum the and minimum allowed values).

  • filter_size – Size of blur kernel to use (will be reduced for small images). :param filter_sigma: Standard deviation for Gaussian blur kernel (will be reduced for small images). :param k1: Constant used to maintain stability in the SSIM calculation (0.01 in the original paper). :param k2: Constant used to maintain stability in the SSIM calculation (0.03 in the original paper).

cumpute_MSSIM(img1, img2, max_val=255, filter_size=11, filter_sigma=1.5, k1=0.01, k2=0.03, weights=None)

Return the MS-SSIM score between img1 and img2.

This function implements Multi-Scale Structural Similarity (MS-SSIM) Image Quality Assessment according to Zhou Wang’s paper, “Multi-scale structural similarity for image quality assessment” (2003). Link: https://ece.uwaterloo.ca/~z70wang/publications/msssim.pdf

Author’s MATLAB implementation: http://www.cns.nyu.edu/~lcv/ssim/msssim.zip

Author’s Python implementation: https://github.com/dashayushman/TAC-GAN/blob/master/msssim.py

Authors documentation:

Parameters:
  • img1 – Numpy array holding the first RGB image batch.

  • img2 – Numpy array holding the second RGB image batch.

  • max_val – the dynamic range of the images (i.e., the difference between the maximum the and minimum allowed values).

  • filter_size – Size of blur kernel to use (will be reduced for small images).

  • filter_sigma – Standard deviation for Gaussian blur kernel ( will be reduced for small images).

  • k1 – Constant used to maintain stability in the SSIM calculation (0.01 in the original paper).

  • k2 – Constant used to maintain stability in the SSIM calculation (0.03 in the original paper).

set_StructuralSimilarity(index, row)
class enb.icompression.WrapperCodec(compressor_path, decompressor_path, param_dict=None, output_invocation_dir=None, signature_in_name=False)

Bases: AbstractCodec

A codec that uses an external process to compress and decompress.

__init__(compressor_path, decompressor_path, param_dict=None, output_invocation_dir=None, signature_in_name=False)
Parameters:
  • compressor_path – path to the executable to be used for compression

  • decompressor_path – path to the executable to be used for decompression

  • param_dict – name-value mapping of the parameters to be used for compression

  • output_invocation_dir – if not None, invocation strings are stored in this directory with name based on the codec and the image’s full path.

Pram signature_in_name:

if True, the default codec name includes part of the hexdigest of the compressor and decompressor binaries being used

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

static get_binary_signature(binary_path)

Return a string with a (hopefully) unique signature for the contents of binary_path. By default, the first 5 digits of the sha-256 hexdigest are returned.

get_compression_params(original_path, compressed_path, original_file_info)

Return a string (shell style) with the parameters to be passed to the compressor.

Same parameter semantics as AbstractCodec.compress().

Parameters:

original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None

get_decompression_params(compressed_path, reconstructed_path, original_file_info)

Return a string (shell style) with the parameters to be passed to the decompressor. Same parameter semantics as AbstractCodec.decompress().

Parameters:

original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None

property name

Return the codec’s name and parameters, also including the encoder and decoder hash summaries (so that changes in the reference binaries can be easily detected)

enb.isets module

Image sets information tables

class enb.isets.BILToBSQ(version_base_dir, original_base_dir=None, csv_support_path=None, original_properties_table=None)

Bases: BIPToBSQ

Convert raw images (no header) from band-interleaved line order (BIL) to band-sequential order (BSQ).

array_order = 'bil'
column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'bil'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

class enb.isets.BIPToBSQ(version_base_dir, original_base_dir=None, csv_support_path=None, original_properties_table=None)

Bases: ImageVersionTable

Convert raw images (no header) from band-interleaved pixel order (BIP) to band-sequential order (BSQ).

__init__(version_base_dir, original_base_dir=None, csv_support_path=None, original_properties_table=None)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • version_name – arbitrary name of this file version

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, enb.config.options.base_dataset_dir is used

  • original_properties_table – instance of the file properties subclass to be used when reading the original data to be versioned. If None, a FilePropertiesTable is instanced automatically.

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

  • check_generated_files – if True, the table checks that each call to version() produces a file to output_path. Set to false to create arbitrarily named output files.

array_order = 'bip'
column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'bip'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

version(input_path, output_path, row)

Create a version of input_path and write it into output_path.

Parameters:
  • input_path – path to the file to be versioned

  • output_path – path where the version should be saved

  • row – metainformation available using super().get_df for input_path

Returns:

if not None, the time in seconds it took to perform the ( forward) versioning.

class enb.isets.BandEntropyTable(csv_support_path=None, base_dir=None)

Bases: ImageGeometryTable

Table to calculate the entropy of each band

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FilePropertiesTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'entropy_per_band': ColumnProperties('name'='entropy_per_band', 'fun'=<function BandEntropyTable.set_entropy_per_band>, 'label'='Entropy per band', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

set_entropy_per_band(file_path, row)

Store a dictionary indexed by band index (zero-indexed) with values being entropy in bits per sample.

class enb.isets.DivisibleSizeVersion(version_base_dir, dimension_size_multiple, original_base_dir=None, csv_support_path=None, original_properties_table=None)

Bases: ImageVersionTable

Crop the spatial dimensions of all (raw) images in a directory so that they are all multiple of a given number. Useful for quickly curating datasets that can be divided into blocks of a given size.

__init__(version_base_dir, dimension_size_multiple, original_base_dir=None, csv_support_path=None, original_properties_table=None)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • dimension_size_multiple – the x and y dimensions of each image are cropped so that they become a multiple of this value, which must be strictly positive. If the image is smaller than this value in either the x dimension, the y dimension, or both, a ValueError is raised.

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, enb.config.options.base_dataset_dir is used

  • original_properties_table – instance of the file properties subclass to be used when reading the original data to be versioned. If None, an enb.isets.ImageGeometryTable is instanced automatically.

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

version(input_path, output_path, row)

Create a version of input_path and write it into output_path.

Parameters:
  • input_path – path to the file to be versioned

  • output_path – path where the version should be saved

  • row – metainformation available using super().get_df for input_path

Returns:

if not None, the time in seconds it took to perform the ( forward) versioning.

class enb.isets.HistogramFullnessTable1Byte(index='index', csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: ATable

Compute an histogram of usage assuming 1-byte samples.

column_to_properties = {'histogram_fullness_1byte': ColumnProperties('name'='histogram_fullness_1byte', 'fun'=<function HistogramFullnessTable1Byte.set_histogram_fullness_1byte>, 'label'='Histogram usage fraction (1 byte)', 'plot_min'=0, 'plot_max'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

set_histogram_fullness_1byte(file_path, row)

Set the fraction of the histogram (of all possible values that can be represented) is actually present in file_path, considering unsigned 1-byte samples.

class enb.isets.HistogramFullnessTable2Bytes(index='index', csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: ATable

Compute an histogram of usage assuming 2-byte samples.

column_to_properties = {'histogram_fullness_2bytes': ColumnProperties('name'='histogram_fullness_2bytes', 'fun'=<function HistogramFullnessTable2Bytes.set_histogram_fullness_2bytes>, 'label'='Histogram usage fraction (2 bytes)', 'plot_min'=0, 'plot_max'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

set_histogram_fullness_2bytes(file_path, row)

Set the fraction of the histogram (of all possible values that can be represented) is actually present in file_path, considering unsigned 2-byte samples.

class enb.isets.HistogramFullnessTable4Bytes(index='index', csv_support_path=None, column_to_properties=None, progress_report_period=None)

Bases: ATable

Compute an histogram of usage assuming 4-byte samples.

column_to_properties = {'histogram_fullness_4bytes': ColumnProperties('name'='histogram_fullness_4bytes', 'fun'=<function HistogramFullnessTable4Bytes.set_histogram_fullness_4bytes>, 'label'='Histogram usage fraction (4 bytes)', 'plot_min'=0, 'plot_max'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

set_histogram_fullness_4bytes(file_path, row)

Set the fraction of the histogram (of all possible values that can be represented) is actually present in file_path, considering 4-byte samples.

class enb.isets.ImageGeometryTable(csv_support_path=None, base_dir=None)

Bases: FilePropertiesTable

Basic properties table for images, including geometry. Allows automatic handling of tags in filenames, e.g., u16be-ZxYxX.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FilePropertiesTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

set_big_endian(file_path, row)

Infer whether the data are big endian from the file path.

set_bytes_per_sample(file_path, row)

Infer the number of bytes per sample based from the file path.

set_column_dtype(file_path, row)

Infer numpy’s data type from the file path.

set_float(file_path, row)

Infer whether the data are floating point from the file path.

set_image_geometry(file_path, row)

Obtain the image’s geometry (width, height and number of components) based on the filename tags (and possibly its size)

set_samples(file_path, row)

Set the number of samples in the image

set_signed(file_path, row)

Infer whether the data are signed from the file path.

set_type_name(file_path, row)

Set the type name usable in file names

verify_file_size = True
class enb.isets.ImagePropertiesTable(csv_support_path=None, base_dir=None)

Bases: ImageGeometryTable

Properties table for images, with geometry and additional statistical information. Allows automatic handling of tags in filenames, e.g., ZxYxX_u16be.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FilePropertiesTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dynamic_range_bits': ColumnProperties('name'='dynamic_range_bits', 'fun'=<function ImagePropertiesTable.set_dynamic_range_bits>, 'label'='Dynamic range (bits)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'entropy_1B_bps': ColumnProperties('name'='entropy_1B_bps', 'fun'=<function ImagePropertiesTable.set_file_entropy>, 'label'='Entropy (bits, 1-byte samples)', 'plot_min'=0, 'plot_max'=8, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'entropy_2B_bps': ColumnProperties('name'='entropy_2B_bps', 'fun'=<function ImagePropertiesTable.set_file_entropy>, 'label'='Entropy (bits, 2-byte samples)', 'plot_min'=0, 'plot_max'=16, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sample_max': ColumnProperties('name'='sample_max', 'fun'=<function ImagePropertiesTable.set_sample_extrema>, 'label'='Max sample value', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sample_min': ColumnProperties('name'='sample_min', 'fun'=<function ImagePropertiesTable.set_sample_extrema>, 'label'='Min sample value', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

set_dynamic_range_bits(file_path, row)

Set minimum number of bits per sample that can be used to store the data (without compression). Until v0.4.4, this value was obtained based on the number of bits needed to represent max-min (where min and max are the minimum and maximum sample values). From version v0.4.5 onwards, the dynamic range B is the minimum integer so that all data samples lie in [0, 2^B-1] for unsigned data and in [-2^(B-1), 2^(B-1)-1] for signed data. The calculation for floating point data is not changed, and is always 8*bytes_per_sample.

set_file_entropy(file_path, row)

Set the zero-order entropy of the data in file_path for 1, 2 and 4 bytes per sample in entropy_1B_bps, entropy_2B_bps and entropy_4B_bps, respectively. If the file is not a multiple of those bytes per sample, -1 is stored instead.

set_sample_extrema(file_path, row)

Set the minimum and maximum values stored in file_path.

class enb.isets.ImageVersionTable(version_base_dir, version_name, original_base_dir=None, csv_support_path=None, check_generated_files=True, original_properties_table=None)

Bases: FileVersionTable, ImageGeometryTable

Transform all images and save the transformed versions.

__init__(version_base_dir, version_name, original_base_dir=None, csv_support_path=None, check_generated_files=True, original_properties_table=None)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • version_name – arbitrary name of this file version

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, enb.config.options.base_dataset_dir is used

  • original_properties_table – instance of the file properties subclass to be used when reading the original data to be versioned. If None, a FilePropertiesTable is instanced automatically.

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

  • check_generated_files – if True, the table checks that each call to version() produces a file to output_path. Set to false to create arbitrarily named output files.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

class enb.isets.QuantizedImageVersion(version_base_dir, qstep, original_base_dir=None, csv_support_path=None, check_generated_files=True, original_properties_table=None)

Bases: ImageVersionTable

Apply uniform quantization and store the results.

__init__(version_base_dir, qstep, original_base_dir=None, csv_support_path=None, check_generated_files=True, original_properties_table=None)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • qstep – quantization step of the uniform quantizer.

  • version_name – arbitrary name of this file version

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, options.base_dataset_dir is used

  • original_properties_table – instance of the file properties subclass to be used when reading the original data to be versioned. If None, a FilePropertiesTable is instanced automatically.

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

  • check_generated_files – if True, the table checks that each call to version() produces a file to output_path. Set to false to create arbitrarily named output files.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

version(input_path, output_path, row)

Apply uniform quantization and store the results.

class enb.isets.SampleDistributionTable(csv_support_path=None, base_dir=None)

Bases: ImageGeometryTable

Compute the data probability distributions.

column_to_properties = {'big_endian': ColumnProperties('name'='big_endian', 'fun'=<function ImageGeometryTable.set_big_endian>, 'label'='Big endian?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'bytes_per_sample': ColumnProperties('name'='bytes_per_sample', 'fun'=<function ImageGeometryTable.set_bytes_per_sample>, 'label'='Bytes per sample', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'component_count': ColumnProperties('name'='component_count', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Components', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'corpus': ColumnProperties('name'='corpus', 'fun'=<function FilePropertiesTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'dtype': ColumnProperties('name'='dtype', 'fun'=<function ImageGeometryTable.set_column_dtype>, 'label'='Numpy dtype', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'float': ColumnProperties('name'='float', 'fun'=<function ImageGeometryTable.set_float>, 'label'='Floating point data?', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'height': ColumnProperties('name'='height', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Height', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sample_distribution': ColumnProperties('name'='sample_distribution', 'fun'=<function SampleDistributionTable.set_sample_distribution>, 'label'='Sample probability distribution', 'plot_min'=0, 'plot_max'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=True, 'has_iterable_values'=False, 'has_object_values'=False), 'samples': ColumnProperties('name'='samples', 'fun'=<function ImageGeometryTable.set_samples>, 'label'='Sample count', 'plot_min'=0, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'signed': ColumnProperties('name'='signed', 'fun'=<function ImageGeometryTable.set_signed>, 'label'='Signed samples', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'type_name': ColumnProperties('name'='type_name', 'fun'=<function ImageGeometryTable.set_type_name>, 'label'='Type name usable in file names', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'width': ColumnProperties('name'='width', 'fun'=<function ImageGeometryTable.set_image_geometry>, 'label'='Width', 'plot_min'=1, 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

set_sample_distribution(file_path, row)

Compute the data probability distribution of the data in file_path.

enb.isets.dump_array(array, file_or_path, mode='wb', dtype=None, order='bsq')

Dump a raw array array indexed in [x,y,z] order into BSQ, BIL or BIP order. BSQ is the concatenation of each component (z axis), each component in raster order. Parent folders are created if not already existing. BIL contains the first row of each of the bands, in order, the the second row of each row, and so forth. BIP contains all components of a pixel, in oder, then the next pixel (in raster order), etc.

Parameters:
  • file_or_path – It can be either a file-like object, or a string-like object. If it is a file, contents are writen without altering the file pointer beforehand. In this case, the file is not closed afterwards. If it is a string-like object, it will be interpreted as a file path, open as determined by the mode parameter.

  • mode – if file_or_path is a path, the output file is opened in this mode

  • dtype – if not None, the array is casted to this type before dumping

  • force_big_endian – if True, a copy of the array is made and its bytes

are swapped before outputting

data to file. This parameter is ignored if dtype is provided.

Parameters:

order – “bsq” for band sequential order, or “bil” for band interleaved.

enb.isets.dump_array_bil(array, file_or_path, mode='wb', dtype=None)

Dump an image array into raw format using band interleaved line (BIL) sample ordering. See enb.isets.dump_array() for more details.

enb.isets.dump_array_bip(array, file_or_path, mode='wb', dtype=None)

Dump an image array into raw format using band interleaved pixel (BIP) sample ordering. See enb.isets.dump_array() for more details.

enb.isets.dump_array_bsq(array, file_or_path, mode='wb', dtype=None)

Dump an image array into raw format using band sequential (BSQ) sample ordering. See enb.isets.dump_array() for more details.

enb.isets.entropy(data)

Compute the zero-order entropy of the provided data

enb.isets.file_path_to_geometry_dict(file_path, existing_dict=None, verify_file_size=True)

Return a dict with basic geometry dict based on the file path and the file size. The basename of the file should contain something like u8be-3x1000x2000, where u8be is the data format (unsigned, 8 bits per sample, big endian) and the dimensions are ZxYxX (Z=3, Y=1000 and X=2000 in this example).

Parameters:
  • file_path – file path whose basename is used to determine the image geometry.

  • existing_dict – if not None, the this dict is updated and then returned. If None, a new dictionary is created.

  • verify_file_size – if True, file_path is expected to be exactly Z*X*Y*byte_per_samples bytes. Otherwise an exception is thrown.

enb.isets.iproperties_row_to_geometry_tag(image_properties_row)

Return an image geometry name tag recognized by isets (e.g., 3x600x800 for an 800x600, 3 component image), given an object similar to an ImageGeometryTable row.

enb.isets.iproperties_row_to_numpy_dtype(image_properties_row)

Return a string that identifies the most simple numpy dtype needed to represent an image with properties as defined in image_properties_row

enb.isets.iproperties_row_to_sample_type_tag(image_properties_row)

Return a sample type name tag as recognized by isets (e.g., u16be), given an object similar to an ImageGeometryTable row.

enb.isets.iproperties_to_name_tag(width, height, component_count, big_endian, bytes_per_sample, signed)

Return a full name tag (including sample type and dimension information), recognized by isets.

enb.isets.kl_divergence(data1, data2)

Return KL(P||Q), KL(Q||P) KL is the KL divergence in bits per sample, P is the sample probability distribution of data1, Q is the sample probability distribution of data2.

If both P and Q contain the same values (even if with different counts), both returned values are identical and as defined in https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Definition.

Otherwise, the formula is modified so that whenever p or q is 0, the factor is skipped from the count. In this case, the two values most likely differ and they should be carefully interpreted.

enb.isets.load_array(file_or_path, image_properties_row=None, width=None, height=None, component_count=None, dtype=None, order='bsq')

Load a numpy array indexed by [x,y,z] from file_or_path using the geometry information in image_properties_row.

Data in the file can be presented in BSQ or BIL order.

Parameters:
  • file_or_path – either a string with the path to the input file, or a file open for reading (typically with “b” mode).

  • image_properties_row

    if not None, it shall be a dict-like object. The width, height, component_count, bytes_per_sample, signed, big_endian and float keys should be present to determine the read parameters. If dtype is provided, then bytes_per_sample, big_endian and float are not used. The remaining arguments overwrite those defined in image_properties_row (if image_properties_row is not None and if present).

    If image_properties_row is None and any of (width, height, component_count, dtype) is None, the image geometry is required to be in the filename as a name tag. These tags, e.g., u8be-3x600x800 inform enb of all the required geometry. The format of these tags (which can appear anywhere in the filename) is:

    • u or s for unsigned and signed, respectively

    • the number of bits per sample (typically, 8, 16, 32 or 64)

    • be or le for big-endian and little-endian formats, respectively

    • ZxYxX, where Z is the number of spectral compoments (3 in the example), X the width (number of columns, 800 in the example) and Y the height (number of rows, 600 in the example).

    If image_properties_row is not None, then the following parameters must not be None:

  • width – if not None, force the read to assume this image width

  • height – if not None, force the read to assume this image height

  • component_count – if not None, force the read to assume this number of components (bands)

  • dtype – if not None, it must by a valid argument for dtype in numpy, and will be used for reading. In this case, the bytes_per_sample, signed, big_endian and float keys are not accessed in image_properties_row.

  • order – “bsq” for band sequential order, or “bil” for band interleaved.

Returns:

a 3-D numpy array with the image data, which can be indexed as [x,y,z].

enb.isets.load_array_bil(file_or_path, image_properties_row=None, width=None, height=None, component_count=None, dtype=None)

Load an array in BIL order. See enb.isets.load_array.

enb.isets.load_array_bip(file_or_path, image_properties_row=None, width=None, height=None, component_count=None, dtype=None)

Load an array in BIP order. See enb.isets.load_array.

enb.isets.load_array_bsq(file_or_path, image_properties_row=None, width=None, height=None, component_count=None, dtype=None)

Load an array in BSQ order. See enb.isets.load_array.

enb.isets.mutual_information(data1, data2)

Compute the mutual information between two vectors of identical length after flattening. Implemented following https://en.wikipedia.org/wiki/Mutual_information#Definition

enb.jpg module

JPEG manipulation (e.g., curation) tools.

class enb.jpg.JPEGCurationTable(original_base_dir, version_base_dir, csv_support_path=None)

Bases: PNGCurationTable

Given a directory tree containing JPEG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'jpg'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

enb.log module

Logging utilities for enb.

It uses only symbols from .misc, but no other module in enb.

class enb.log.LogLevel(name, priority=0, prefix=None, help_message=None, style=None)

Bases: object

Each of the available logging levels is an instance of this class. A level represents a named type of message, with a priority comparable to other levels.

__init__(name, priority=0, prefix=None, help_message=None, style=None)
Parameters:
  • priority – minimum priority level needed to show this level.

  • name – unique name for the level.

  • prefix – prefix when printing messages of this level. If None, a default one is used based on the name.

  • help_message – optional help explaining the purpose of the level.

  • style – if not None, a color with which messages of this level are displayed. See https://rich.readthedocs.io/en/stable/appendix/colors.html for more details about available colors. If None, the default color is used.

class enb.log.Logger(*args, **kwargs)

Bases: object

Message logging and printing hub for enb.

Messages are only shown if their priority is at least as high as the configured minimum.

The minimum required level name (from “core” down to “debug”) can be selected via the CLI and the file-based configuration by setting the selected_log_level flag/option.

You can then modify this minimum value programmatically by setting enb.config.options.minimum_priority_level to a new LogLevel instance, such as LOG_VERBOSE or any of the other constants defined above.

__init__()
banner_enb_name_style = '#f3ac05 bold'
banner_enb_version_style = '#9b5ccb bold'
banner_line_style = '#767676 bold'
banner_plain_text_style = '#767676'
core(msg, **kwargs)

A message of “core” level.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property core_active

Return True if and only if the core level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the core level.

core_context(msg, sep='...', msg_after=None, show_duration=True)

Logging context of core priority.

Parameters:
  • msg – Message to show before starting the code block.

  • sep – separator printed between msg_before and msg_after ( newline is not required in it to allow single-line reporting).

  • msg_after – message shown after msg and sep upon completion.

  • show_duration – if True, a message displaying the run time is logged upon completion.

debug(msg, **kwargs)

Log a debug trace.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property debug_active

Return True if and only if the debug level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the debug level.

debug_context(msg, sep='...', msg_after=None, show_duration=True)

Logging context of debug priority.

Parameters:
  • msg – Message to show before starting the code block.

  • sep – separator printed between msg_before and msg_after ( newline is not required in it to allow single-line reporting)

  • msg_after – message shown after msg and sep upon completion.

  • show_duration – if True, a message displaying the run time is logged upon completion.

error(msg, **kwargs)

Log an error message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property error_active

Return True if and only if the error level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the error level.

get_level(name, lower_priority=0)

If lower_priority is 0, return the logging level associated with the name passed as argument. Otherwise, the aforementioned level’s priority is lowered by that numeric amount (positive values means less prioritary levels can be selected).

After that, the available level with the closest priority is chosen.

info(msg, **kwargs)

Log an extra-informative console message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property info_active

Return True if and only if the info level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the info level.

info_context(msg, sep='...', msg_after=None, show_duration=True)

Logging context of info priority.

Parameters:
  • msg – Message to show before starting the code block.

  • sep – separator printed between msg_before and msg_after ( newline is not required in it to allow single-line reporting)

  • msg_after – message shown after msg and sep upon completion.

  • show_duration – if True, a message displaying the run time is logged upon completion.

property is_parallel_process

Lazy property to determine whether this is currently a parallel process.

property is_ray_enabled

Lazy property to determine whether ray is available and enabled.

level_active(name, **kwargs)

Return True if and only if the given name corresponds to a level with priority sufficient given self.min_priority_level.

levels_by_priority()

Return a list of the available levels, sorted from higher to lower priority.

log(msg, level, end='\n', file=None, markup=False, highlight=False, style=None, rule=False, rule_kwargs=None)

Conditionally log a message given its level. It only shares “end” with builtins.print as keyword argument.

Parameters:
  • msg – message to be logged

  • level – priority level for the message

  • end – string appended after the message, if it is shown.

  • file – file where to log the message, or None to automatically select sys.stdout

  • markup – should rich markup be interpreted within the message?

  • highlight – should rich apply automatic highlighting of numbers, constants, etc., to the message?

  • style – if not None, the level’s current style is overwritten by this

  • rule – should the message be displayed with console.rule()?

  • rule_kwargs – if rule_kwargs is True, these parameters are passed to console.rule

log_context(msg, level, sep='...', msg_after=None, show_duration=True)

Log a message before executing the with block code, run the block, and log another message when the block is completed. The message given the selected priority level, and is only displayed based on self.selected_log_level. The block of code is executed regardless of the logging options.

Parameters:
  • msg – Message typically describing the

  • level – Priority level for the shown messages.

  • sep – separator printed between msg_before and msg_after ( newline is not required in it to allow single-line reporting)

  • msg_after – message shown after msg and sep upon completion. If none, one is automatically selected based on msg.

  • show_duration – if True, a message displaying the run time is logged upon completion.

message(msg, **kwargs)

Log a regular console message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property message_active

Return True if and only if the message level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the message level.

message_context(msg, sep='...', msg_after=None, show_duration=True)

Logging context of message priority.

Parameters:
  • msg – Message to show before starting the code block.

  • sep – separator printed between msg_before and msg_after ( newline is not required in it to allow single-line reporting)

  • msg_after – message shown after msg and sep upon completion.

  • show_duration – if True, a message displaying the run time is logged upon completion.

print_to_log(*args, sep=' ', end='\n', file=None)

Method used to substitute print if configured to do so. If file is None, then sys.stdout is used by default.

report_level_status()
Returns:

a string reporting the present logging levels and whether

or not they are active.

show_banner(level=None)

Shows the enb banner, including the current version.

Parameters:

level – the priority level with which the banner is shown. If None, verbose is used by default.

style_core = '#28c9ff'
style_debug = '#909090'
style_error = 'bold #ff5255'
style_info = '#9b5ccb'
style_message = '#28c9ff'
style_verbose = '#a5d3a5'
style_warn = '#ffca4f'
verbose(msg, **kwargs)

Log a verbose console message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property verbose_active

Return True if and only if the verbose level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the verbose level.

verbose_context(msg, sep='...', msg_after=None, show_duration=True)

Logging context of verbose priority.

Parameters:
  • msg – Message to show before starting the code block.

  • sep – separator printed between msg_before and msg_after ( newline is not required in it to allow single-line reporting)

  • msg_after – message shown after msg and sep upon completion.

  • show_duration – if True, a message displaying the run time is logged upon completion.

warn(msg, **kwargs)

Log a warning message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

property warn_active

Return True if and only if the warn level is currently active, i.e., the current self.min_priority_level has a greater or equal priority value than the warn level.

enb.log.core(msg, **kwargs)

A message of “core” level.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.log.debug(msg, **kwargs)

Log a debug trace.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.log.error(msg, **kwargs)

Log an error message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.log.get_level(name, lower_priority=0)

If lower_priority is 0, return the logging level associated with the name passed as argument. Otherwise, the aforementioned level’s priority is lowered by that numeric amount (positive values means less prioritary levels can be selected).

After that, the available level with the closest priority is chosen.

enb.log.info(msg, **kwargs)

Log an extra-informative console message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.log.log(msg, level, end='\n', file=None, markup=False, highlight=False, style=None, rule=False, rule_kwargs=None)

Conditionally log a message given its level. It only shares “end” with builtins.print as keyword argument.

Parameters:
  • msg – message to be logged

  • level – priority level for the message

  • end – string appended after the message, if it is shown.

  • file – file where to log the message, or None to automatically select sys.stdout

  • markup – should rich markup be interpreted within the message?

  • highlight – should rich apply automatic highlighting of numbers, constants, etc., to the message?

  • style – if not None, the level’s current style is overwritten by this

  • rule – should the message be displayed with console.rule()?

  • rule_kwargs – if rule_kwargs is True, these parameters are passed to console.rule

enb.log.message(msg, **kwargs)

Log a regular console message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.log.report_level_status()
Returns:

a string reporting the present logging levels and whether

or not they are active.

enb.log.show_banner(level=None)

Shows the enb banner, including the current version.

Parameters:

level – the priority level with which the banner is shown. If None, verbose is used by default.

enb.log.verbose(msg, **kwargs)

Log a verbose console message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.log.warn(msg, **kwargs)

Log a warning message.

Parameters:

kwargs – optional arguments passed to self.log (must be compatible)

enb.misc module

Miscellaneous tools for enb.

This module does not and should not import anything from enb, so that other modules may use misc tools at definition time.

class enb.misc.BootstrapLogger

Bases: object

Imitate enb.log.Logger’s interface before it is loaded. This is needed to solve circular imports, i.e., when an error with the managed attributes decorator takes place before the full logger is available (within the config submodule).

__init__()
core(*args, **kwargs)
debug(*args, **kwargs)
error(*args, **kwargs)
info(*args, **kwargs)
log(*args, style=None, **kwargs)
message(*args, **kwargs)
verbose(*args, **kwargs)
warn(*args, **kwargs)
class enb.misc.CircularList(iterable=(), /)

Bases: list

A tuned list that automatically applies modulo len(self) to the given index, allowing for circular, index-based access to the data (whereas itertools.cycle does not allow accessing elements by index).

class enb.misc.ExposedProperty(instance, property_name)

Bases: object

This method can be used to expose object properties as public callables that return what requesting that property would.

__init__(instance, property_name)
class enb.misc.LapTimer

Bases: object

Keep track of time duration similar to a lap timer. Useful to track the time elapsed between consecutive calls to print_lap.

__init__()
print_lap(msg=None)

Print the elapsed time since the last time this method was called, or when this instance was created if it is the first time this method is called.

class enb.misc.Singleton

Bases: type

Classes using this as metaclass will only be instantiated once.

enb.misc.capture_usr1()

Capture the reception of a USR1 signal into pdb.

From http://blog.devork.be/2009/07/how-to-bring-running-python-program.html.

enb.misc.class_to_fqn(cls)

Given a class (type instance), return its fully qualified name (FQN).

enb.misc.csv_to_latex_tabular(input_csv_path, output_tex_path, contains_header=True, use_booktabks=True)

Read a CSV table from a file and output it as a latex table to another file. The first row is assumed to be the header.

Parameters:
  • input_csv_path – path to a file containing CSV data.

  • output_tex_file – path where the tex contents are to be stored, ready to be added to latex with the input command.

  • contains_header – if True, the first line is assumed to be a header containing column names.

  • use_booktabs – if True, a booktabs-based decoration style is used for the table. Otherwise, standard latex is used only.

enb.misc.escape_latex(s)

Return a latex-scaped version of string s.

enb.misc.get_all_subclasses(*base_classes)

Return a set of all subclasses of the classes in base_classes, which have been defined at this point.

The base classes are never returned as subclasses.

Parameters:

base_classes – the list of classes for which subclasses are to be found

enb.misc.get_defining_class_name(method)

Return the name of the class of which f is a method, or None if not bound to any class.

enb.misc.get_node_ip()

Get the current IP address of this node.

enb.misc.get_node_name()

Get the host name of this node. Alias for socket.gethostname.

enb.misc.remove_argparse_action(parser, action)

Entirely remove an action from a parser, from its subparsers and groups if it exists. Adapted from https://stackoverflow.com/a/49753634.

enb.misc.split_camel_case(camel_string)

Split a camel case string like ThisIsAClass into a string like “This Is A Class”.

enb.parallel module

Abstraction layer to provide parallel processing both locally and on ray clusters.

class enb.parallel.FallbackFuture(fun, args, kwargs)

Bases: object

The fallback future is invoked when get is called.

__init__(fun, args, kwargs)
current_id = 0
get(**kwargs)

Blocking get of the return of the parallelized function.

pathos_pool = None
ready()

Return True if the result has been received from the parallelized function.

class enb.parallel.ProgressiveGetter(id_list, weight_list=None, iteration_period=1, alive_bar=None)

Bases: object

When an instance is created, the computation of the requested list of calls is started in parallel the background (unless they are already running).

The returned instance is an iterable object. Each to next() with this instance will either return the instance if any tasks are still running, or raise StopIteration if all are complete. Therefore, instances of this class can be used as the right operand of in in for loops.

A main application of this for-loop approach is to periodically run a code snippet (e.g., for logging) while the computation is performed in the background. The loop will continue until all tasks are completed. One can then call ray.get(ray_id_list) and retrieve the obtained results without any expected delay.

Note that the for-loop body will always be executed at least once, namely after every potentially blocking call to ray.wait().

__init__(id_list, weight_list=None, iteration_period=1, alive_bar=None)

Start the background computation of ids returned by start calls of methods decorated with enb.paralell.parallel. After this call, the object is ready to receive next() requests.

Parameters:
  • id_list – list ids whose values are to be returned.

  • weight_list – if not None, a list of the same length as ray_id list, which contains nonnegative values that describe the weight of each task. If provided, they should be highly correlated with the computation time of each associated task to provide accurate completion time estimations.

  • iteration_period – a non-negative value that determines the wait period allowed for ray to obtain new results when next() is used. When using this instance in a for loop, it determines approximately the periodicity with which the loop body will be executed.

  • alive_bar – if not None, it should be bar instance from the alive_progress library, while inside its with-context. If it is provided, it is called with the fraction of available tasks on each call to update_finished_tasks.

report()

Return a string that represents the current state of this progressive run.

update_finished_tasks(timeout=None)

Wait for up to timeout seconds or until ray completes computation of all pending tasks. Update the list of completed and pending tasks.

enb.parallel.fallback_get(ids, **kwargs)

Fallback get method when ray is not available.

enb.parallel.fallback_get_completed_pending_ids(ids, timeout=0)

Get two lists, one for completed and one for pending fallback ids.

enb.parallel.fallback_init()

Initialization of the fallback engine. This needs to be called before each parallelization, or globals used in the pool might be updated.

enb.parallel.fallback_parallel_decorator(*decorator_args, **decorator_kwargs)

Decorator for methods intended to run in parallel in the local machine.

enb.parallel.get(ids, **kwargs)

Get results for the started ids passed as arguments.

If timeout is part of kwargs, at most those many seconds are waited. Otherwise, this is a blocking call.

enb.parallel.get_completed_pending_ids(ids, timeout=0)

Given a list of ids returned by start calls, return two lists: the first one with the input ids that are ready, and the second one with the input ids that are not.

enb.parallel.init()

If ray is present, this method initializes it. If the fallback engine is used, it is ensured that all globals are correctly shared with the pool.

enb.parallel.parallel(*args, **kwargs)

Decorator for methods intended to run in parallel.

When ray is available, the .remote() call is performed on the ray decorated function. When it is not, a fallback parallelization method is used.

To run a parallel method f, call f.start with the arguments you want to pass to f. An id object is returned immediately. The result can then be retrieved by calling enb.parallel.get with the id object.

Important: parallel calls should not generally read or modify global variables. The main exception is enb.config.options, which can be read from parallel calls.

enb.parallel.parallel_fix_dill_crash()

Temporary fix of a crash in dill when a module’s __file__ attribute is defined but is None.

enb.parallel_ray module

Tools to execute functions in remote using the ray library

class enb.parallel_ray.HeadNode(ray_port, ray_port_count)

Bases: object

Class used to initialize and stop a ray head node.

The stop() method must be called after start(), or a ray cluster will remain active.

__init__(ray_port, ray_port_count)
get_node_ip()

Adapted from https://stackoverflow.com/a/166589/992926.

parse_cluster_config_csv(csv_path)

Read a CSV defining remote nodes and return a list with as many RemoteNode as data rows in the CSV.

start()

Start or restart a ray head node.

property status_str

Return a string reporting the status of the cluster

stop()

Stop the ray head node after disconnecting from all remote nodes.

class enb.parallel_ray.RemoteNode(address, ssh_port, head_node, ssh_user=None, local_ssh_file=None, cpu_limit=None, remote_mount_needed=None)

Bases: object

Represent a remote node of the cluster, with tools to connect via ssh.

__init__(address, ssh_port, head_node, ssh_user=None, local_ssh_file=None, cpu_limit=None, remote_mount_needed=None)
connect()

Connect to a remote ray head node.

disconnect()

Disconnect from a remote node.

mount_project_remotely()

Use sshfs to mount the remote project folder into the remote node.

remote_project_mount_path = '/home/miguelinux/.config/enb/remote_mount'
enb.parallel_ray.chdir_project_root()

When invoked, it changes the current working dir to the project’s root. It will be the remote mount point if the node is remote, otherwise the directory containing the invoking script.

enb.parallel_ray.fix_imports()

An environment variable is passed to the children processes for them to be able to import all modules that were imported after loading enb. This prevents the remote functions to fail the deserialization process due to missing definitions.

enb.parallel_ray.get(ids, **kwargs)

Call ray’s get method with the given arguments.

enb.parallel_ray.get_completed_pending_ids(ids, timeout=0)

Return the list of completed and pending ids.

enb.parallel_ray.init_ray()

Initialize the ray cluster if it wasn’t initialized before.

enb.parallel_ray.is_parallel_process()

Return True if and only if the call is made from a remote ray process, which can be running in the head node or any of the remote nodes (if any is present).

enb.parallel_ray.is_ray_enabled()

Return True if and only if ray is available and the current platform is one of the supported for ray clustering (currently only linux).

enb.parallel_ray.is_ray_initialized()

Return True if and only if ray is enabled and initialized.

enb.parallel_ray.is_remote_node()

Return True if and only if the call is performed from a remote ray process running on a node different from the head.

enb.parallel_ray.parallel_decorator(*args, **kwargs)

Wrapper of the @`ray.remote` decorator that automatically updates enb.config.options for remote processes, so that they always access the intended configuration.

enb.parallel_ray.stop_ray()

Stop the ray head node, if one is defined.

enb.pgm module

Module to handle PGM (P5) and PPM (P6) images

class enb.pgm.PGMCurationTable(original_base_dir, version_base_dir, csv_support_path=None)

Bases: PNGCurationTable

Given a directory tree containing PGM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'pgm'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

class enb.pgm.PGMWrapperCodec(compressor_path, decompressor_path, param_dict=None, output_invocation_dir=None, signature_in_name=False)

Bases: WrapperCodec

Raw images are coded into PNG before compression with the wrapper, and PNG is decoded to raw after decompression.

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

enb.pgm.pgm_to_raw(input_path, output_path)

Read a file in PGM format and write its contents in raw format, which does not include any geometry or data type information.

enb.pgm.ppm_to_raw(input_path, output_path)

Read a file in PPM format and write its contents in raw format, which does not include any geometry or data type information.

enb.pgm.read_pgm(input_path, byteorder='>')

Return image data from a raw PGM file as numpy array. Format specification: http://netpbm.sourceforge.net/doc/pgm.html

(From answer: https://stackoverflow.com/questions/7368739/numpy-and-16-bit-pgm)

enb.pgm.read_ppm(input_path, byteorder='>')

Return image data from a raw PGM file as numpy array. Format specification: http://netpbm.sourceforge.net/doc/pgm.html

(From answer: https://stackoverflow.com/questions/7368739/numpy-and-16-bit-pgm)

enb.pgm.write_pgm(array, bytes_per_sample, output_path, byteorder='>')

Write a 2D array indexed with [x,y] into output_path with PGM format.

enb.pgm.write_ppm(array, bytes_per_sample, output_path)

Write a 3-component 3D array indexed with [x,y,z] into output_path with PPM format.

enb.plotdata module

Utils to plot data (thinking about pyplot).

class enb.plotdata.BarData(pattern=None, vertical=True, **kwargs)

Bases: PlottableData2D

Vertical (horizonal) bars at x (y) positions of height (width) y (x).

__init__(pattern=None, vertical=True, **kwargs)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

pattern = None
render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

shift_y(constant)

Add a constant to all y values.

vertical = True
class enb.plotdata.ErrorLines(x_values, y_values, err_neg_values, err_pos_values, vertical=False, alpha=0.5, color=None, marker_size=1, cap_size=2, line_width=1, **kwargs)

Bases: PlottableData2D

One or more error lines

__init__(x_values, y_values, err_neg_values, err_pos_values, vertical=False, alpha=0.5, color=None, marker_size=1, cap_size=2, line_width=1, **kwargs)
Parameters:
  • y_values (x_values,) – centers of the error lines

  • err_neg_values – list of lengths for the negative part of the error

  • err_pos_values – list of lengths for the positive part of the error

  • vertical – determines whether the error bars are vertical or horizontal

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.Histogram2D(x_edges, y_edges, matrix_values, color_map=None, colormap_label=None, vmin=None, vmax=None, no_data_color=(1, 1, 1, 0), bad_data_color='magenta', **kwargs)

Bases: PlottableData2D

Represent the result of a 2D histogram. See https://matplotlib.org/stable/gallery/color/colormap_reference.html

__init__(x_edges, y_edges, matrix_values, color_map=None, colormap_label=None, vmin=None, vmax=None, no_data_color=(1, 1, 1, 0), bad_data_color='magenta', **kwargs)
Parameters:
  • x_edges – the edges of the histogram along the x axis.

  • y_edges – the edges of the histogram along the y axis.

  • matrix_values – values of the histogram (2d array of dimensions given by the length of x_edges and y_edges).

  • no_data_color – color shown when no counts are found in a bin

  • bad_data_color – color shown when nan is found in a bin

  • vmin – minimum value considered in the histogram

  • vmax – minimum value considered in the histogram

  • kwargs – additional parameters passed to the parent class initializer

alpha = 1
aspect = 'equal'
bad_data_color = 'magenta'
color_map = 'Reds'
interpolation = 'none'
no_data_color = (1, 1, 1, 0)
origin = 'lower'
render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

show_cmap_bar = True
vmax = None
vmin = None
class enb.plotdata.HorizontalBand(x_values, y_values, pos_height_values, neg_height_values, show_bounding_lines=None, degradation_band_count=None, std_band_add_xmargin=False, **kwargs)

Bases: PlottableData2D

Colored band surrounding a line function given by the provided x, y positions, with positive (up) and negative (down) widths.

__init__(x_values, y_values, pos_height_values, neg_height_values, show_bounding_lines=None, degradation_band_count=None, std_band_add_xmargin=False, **kwargs)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

alpha = 0.5
degradation_band_count = 25
render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

show_bounding_lines = False
class enb.plotdata.HorizontalLine(y_position, line_width=1, line_style='-', **kwargs)

Bases: PlottableData2D

Draw a horizontal line across the whole subplot.

__init__(y_position, line_width=1, line_style='-', **kwargs)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.LineData(marker='o', marker_size=5, line_width=1.5, **kwargs)

Bases: PlottableData2D

Straight lines linking the defined x,y pairs.

__init__(marker='o', marker_size=5, line_width=1.5, **kwargs)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.LineSegment(x_values, y_values, length, vertical=True, line_width=1, **kwargs)

Bases: PlottableData2D

Render a horizontal or vertical line segment centered at a given position.

__init__(x_values, y_values, length, vertical=True, line_width=1, **kwargs)
Parameters:
  • x_values – a list with a single element with the x position of the rectangle’s center

  • y_values – a list with a single element with the y position of the rectangle’s center

  • length – length of the line

alpha = 0.5
property center

Line center (the line crosses this point).

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.PlottableData(data=None, axis_labels=None, label=None, extra_kwargs=None, alpha=None, legend_column_count=None, legend_position=None, marker=None, color=None, marker_size=None)

Bases: object

Base class for plottable data elements. Subclasses are instantiated by the analyzers to define sequences of plotting actions (with pyplot) that result in the desired figure.

__init__(data=None, axis_labels=None, label=None, extra_kwargs=None, alpha=None, legend_column_count=None, legend_position=None, marker=None, color=None, marker_size=None)
alpha = 0.75
color = None
legend_column_count = 1
legend_position = 'title'
marker = None
render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

render_axis_labels(axes=None)

Add axis labels in current figure - don’t show or write the result

Parameters:

axes – if not None, those axes are used for plotting instead of plt

render_legend(axes=None)

Tell numpy to render the legend based on self.legend_position. In addition to numpy’s legend positions, ‘title’ is available for top center outside the grid.

class enb.plotdata.PlottableData2D(x_values, y_values, x_label=None, y_label=None, label=None, extra_kwargs=None, remove_duplicates=False, alpha=None, legend_column_count=None, marker=None, color=None, marker_size=None)

Bases: PlottableData

Base class for 2D plottable data.

__init__(x_values, y_values, x_label=None, y_label=None, label=None, extra_kwargs=None, remove_duplicates=False, alpha=None, legend_column_count=None, marker=None, color=None, marker_size=None)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

diff(other, ylabel_suffix='_difference')

Calculate the difference with another PlattableData2D instance. The x values are maintained and other’s y values are subtracted from self’s.

render_axis_labels(axes=None)

Show the labels in label list (if not None) or in self.axis_label_list (if label_list None) in the current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

shift_x(constant)

Add a constant to all x values.

shift_y(constant)

Add a constant to all y values.

class enb.plotdata.Rectangle(x_values, y_values, width, height, angle_degrees=0, fill=False, line_width=1, **kwargs)

Bases: PlottableData2D

Render a rectangle (line only) in a given position.

__init__(x_values, y_values, width, height, angle_degrees=0, fill=False, line_width=1, **kwargs)
Parameters:
  • x_values – a list with a single element with the x position of the rectangle’s center

  • y_values – a list with a single element with the y position of the rectangle’s center

  • width – width of the rectangle

  • height – height of the rectangle

alpha = 0.5
render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.ScatterData(marker='o', alpha=0.5, marker_size=3, **kwargs)

Bases: PlottableData2D

Individual markers at the specified x,y positions.

__init__(marker='o', alpha=0.5, marker_size=3, **kwargs)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.StepData(x_values, y_values, x_label=None, y_label=None, label=None, extra_kwargs=None, remove_duplicates=False, alpha=None, legend_column_count=None, marker=None, color=None, marker_size=None)

Bases: PlottableData2D

Horizontal segments at the defined x,y positions.

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

where = 'post'
class enb.plotdata.Table(x_values, y_values, cell_text, x_label=None, y_label=None, cell_alignment='center', col_header_aligment='center', row_header_alignment='left', edges='open', highlight_best_row=None, highlight_best_column=None)

Bases: PlottableData2D

Display a 2D table of data.

__init__(x_values, y_values, cell_text, x_label=None, y_label=None, cell_alignment='center', col_header_aligment='center', row_header_alignment='left', edges='open', highlight_best_row=None, highlight_best_column=None)
Parameters:
  • x_values – list of column headers

  • y_values – list of row headers

  • cell_text – list of lists containing the cell text to display

  • cell_alignment – text alignment for the data cells

  • col_header_alignment – text alignment for the column headers

  • row_header_alignment – text alignment for the row headers

  • edges – argument passed to plt.table (substring of ‘BRTL’ or {‘open’, ‘closed’, ‘horizontal’, ‘vertical’})

  • highlight_best_row – if not None, it must be either “low” or “high”. In those cases, the best (lowest or highest) value of each column is highlighted.

  • highlight_best_column – if not None, it must be either “low” or “high”. In those cases, the best (lowest or highest) value of each row is highlighted.

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

class enb.plotdata.VerticalLine(x_position, line_width=1, line_style='-', **kwargs)

Bases: PlottableData2D

Draw a horizontal line across the whole subplot.

__init__(x_position, line_width=1, line_style='-', **kwargs)
Parameters:
  • y_values (x_values,) – values to be plotted (only a reference is kept)

  • y_label (x_label,) – axis labels

  • label – line legend label

  • extra_kwargs – extra arguments to be passed to plt.plot

render(axes=None)

Render data in current figure.

Parameters:

axes – if not None, those axes are used for plotting instead of plt

enb.plotdata.get_available_styles()

Get a list of all styles available for plotting. It includes installed matplotlib styles plus custom styles

enb.plotdata.get_local_styles()

Get the list of basenames of all styles in the enb/config/mpl_styles folder.

enb.plotdata.get_matlab_styles()

Return the list of installed matlab styles.

enb.png module

PNG manipulation (e.g., curation) tools.

class enb.png.PDFToPNG(input_pdf_dir, output_png_dir, csv_support_path=None)

Bases: FileVersionTable

Take all .pdf files in input dir and save them as .png files into output_dir, maintining the relative folder structure.

__init__(input_pdf_dir, output_png_dir, csv_support_path=None)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • version_name – arbitrary name of this file version

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, enb.config.options.base_dataset_dir is used

  • original_properties_table – instance of the file properties subclass to be used when reading the original data to be versioned. If None, a FilePropertiesTable is instanced automatically.

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

  • check_generated_files – if True, the table checks that each call to version() produces a file to output_path. Set to false to create arbitrarily named output files.

column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'pdf'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

version(input_path, output_path, row)

Create a version of input_path and write it into output_path.

Parameters:
  • input_path – path to the file to be versioned

  • output_path – path where the version should be saved

  • row – metainformation available using super().get_df for input_path

Returns:

if not None, the time in seconds it took to perform the ( forward) versioning.

class enb.png.PNGCurationTable(original_base_dir, version_base_dir, csv_support_path=None)

Bases: FileVersionTable

Given a directory tree containing PNG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

__init__(original_base_dir, version_base_dir, csv_support_path=None)
Parameters:
  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, options.base_datset_dir is used

  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'png'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

version(input_path, output_path, row)

Transform PNG files into raw images with name tags recognized by isets.

class enb.png.PNGWrapperCodec(compressor_path, decompressor_path, param_dict=None, output_invocation_dir=None, signature_in_name=False)

Bases: WrapperCodec

Raw images are coded into PNG before compression with the wrapper, and PNG is decoded to raw after decompression.

compress(original_path: str, compressed_path: str, original_file_info=None)

Compress original_path into compress_path using param_dict as params. :param original_path: path to the original file to be compressed :param compressed_path: path to the compressed file to be created :param original_file_info: a dict-like object describing

original_path’s properties (e.g., geometry), or None.

Returns:

(optional) a CompressionResults instance, or None (see compression_results_from_paths)

decompress(compressed_path, reconstructed_path, original_file_info=None)

Decompress compressed_path into reconstructed_path using param_dict as params (if needed).

Parameters:
  • compressed_path – path to the input compressed file

  • reconstructed_path – path to the output reconstructed file

  • original_file_info – a dict-like object describing original_path’s properties (e.g., geometry), or None. Should only be actually used in special cases, since codecs are expected to store all needed metainformation in the compressed file.

Returns:

(optional) a DecompressionResults instance, or None (see

decompression_results_from_paths)

enb.png.pdf_to_png(input_dir, output_dir)

Take all .pdf files in input dir and save them as .png files into output_dir, maintining the relative folder structure.

It is perfectly valid for input_dir and output_dir to point to the same location, but input_dir must exist beforehand.

enb.png.raw_path_to_png(raw_path, png_path, image_properties_row=None)

Render an uint8 or uint16 raw image with 1, 3 or 4 components.

Parameters:
  • raw_path – path to the image in raw format to render in png.

  • png_path – path where the png file is to be stored.

  • image_properties_row – if row_path does not contain geometry information, this parameter should be a dict-like object that indicates width, height, number of components, bytes per sample, signedness and endianness if applicable.

enb.png.render_array_png(img, png_path)

Render an uint8 or uint16 image with 1, 3 or 4 components. :param img: image array indexed by [x,y,z]. :param png_path: path where the png file is to be stored.

enb.progress module

Tools to display live progress of ATable instances based on the rich library.

class enb.progress.ProgressTracker(atable, row_count: int, chunk_size: int)

Bases: Live

Keep track of the progress of an ATable’s (incl. Experiments’) get_df.

__init__(atable, row_count: int, chunk_size: int)
Parameters:
  • atable – ATable subclass instance for which the progress is to be tracked

  • row_count – total number of rows that need to be computed

  • chunk_size – chunk size (any non-positive number is also interpreted as a chunk size equal to row_count)

property chunk_count

Get the number of chunks defined for this progress tracking stage.

complete_chunk()

Add 1 to the number of completed chunks if a chunk task has been defined.

class property console

Return the console instance for the current instance (the only live instance) of ProgressTracker, or None if none are available.

style_bar_complete = '#9b5ccb bold'
style_bar_finished = '#f3ac05'
style_bar_incomplete = '#252525'
style_border = '#adadad bold'
style_spinner = '#9b5ccb bold'
style_text_completed = '#bcbcbc bold'
style_text_label = '#787878'
style_text_percentage = '#bcbcbc'
style_text_separator = '#505050'
style_text_speed = '#bcbcbc'
style_text_time = '#bcbcbc'
style_text_total = '#bcbcbc'
style_text_unit = '#707070'
style_title_analyzer = '#1990ff'
style_title_atable = '#9b59ff'
style_title_dataset = '#45e193'
style_title_experiment = '#13bf00'
style_title_other = '#cdabff'
style_title_summary = '#23cfff'
update_chunk_completed_rows(chunk_completed_rows)

Set the number of rows completed for the current chunk.

enb.progress.is_progress_enabled()

Return True if and only if all conditions for displaying progress are met: - verbose being level 1 (verbose) or 2 (info) - disable_progress_bar is set to False

enb.render module

Render plots using matplotlib and enb.plotdata.PlottableData instances.

enb.render.parallel_render_plds_by_group(pds_by_group_name, output_plot_path, column_properties, global_x_label, global_y_label, combine_groups=False, color_by_group_name=None, group_name_order=None, fig_width=None, fig_height=None, legend_column_count=None, force_monochrome_group=True, show_grid=None, show_subgrid=None, grid_alpha=0.6, subgrid_alpha=0.4, semilog_y=None, semilog_y_base=10, semilog_y_min_bound=1e-10, group_row_margin=None, x_min=None, x_max=None, y_min=None, y_max=None, horizontal_margin=None, vertical_margin=None, global_y_label_margin=None, y_labels_by_group_name=None, tick_direction='in', x_tick_list=None, x_tick_label_list=None, x_tick_label_angle=0, y_tick_list=None, y_tick_label_list=None, left_y_label=False, extra_plds=(), plot_title=None, title_y=None, show_legend=True, legend_position=None, style_list=())

Ray wrapper for render_plds_by_group. See that method for parameter information.

enb.render.render_plds_by_group(pds_by_group_name, output_plot_path, column_properties, global_x_label, global_y_label, combine_groups=False, color_by_group_name=None, group_name_order=None, fig_width=None, fig_height=None, legend_column_count=None, force_monochrome_group=True, show_grid=None, show_subgrid=None, grid_alpha=0.6, subgrid_alpha=0.5, semilog_y=None, semilog_y_base=10, semilog_y_min_bound=1e-10, group_row_margin=None, x_min=None, x_max=None, horizontal_margin=None, vertical_margin=None, y_min=None, y_max=None, global_y_label_margin=None, y_labels_by_group_name=None, tick_direction='in', x_tick_list=None, x_tick_label_list=None, x_tick_label_angle=0, y_tick_list=None, y_tick_label_list=None, left_y_label=False, extra_plds=(), plot_title=None, title_y=None, show_legend=True, legend_position=None, style_list=('default',))

Render lists of plotdata.PlottableData instances indexed by group name. Each group is rendered in a row (subplot), with a shared X axis. Groups can also be combined into a single row (subplot), i.e., rending all plottable data into that single subplot.

When applicable, None values are substituted by default values given enb.config.options (guaranteed to be updated thanks to the @enb.parallel.parallel decorator) and the current context.

Mandatory parameters:

Parameters:
  • pds_by_group_name – dictionary of lists of PlottableData instances

  • output_plot_path – path to the file to be created with the plot

  • column_properties – ColumnProperties instance for the column being plotted

  • global_x_label – x-axis label shared by all subplots (there can be just one subplot)

  • global_y_label – y-axis label shared by all subplots (there can be just one subplot)

General figure configuration. If None, most of these values are retrieved from the [enb.aanalysis.Analyzer] section of *.ini files.

Parameters:
  • combine_groups – if False, each group is plotted in a different row. If True, all groups share the same subplot (and no group name is displayed).

  • color_by_group_name – if not None, a dictionary of pyplot colors for the groups, indexed with the same keys as pds_by_group_name.

  • group_name_order – if not None, it contains the order in which groups are displayed. If None, alphabetical, case-insensitive order is applied.

  • fig_width – figure width. The larger the figure size, the smaller the text will look.

  • fig_height – figure height. The larger the figure size, the smaller the text will look.

  • legend_column_count – when the legend is shown, use this many columns.

  • force_monochrome_group – if True, all plottable data with non-None color in each group is set to the same color, defined by color_cycle.

Axis configuration:

Parameters:
  • show_grid – if True, or if None and options.show_grid, grid is displayed aligned with the major axes.

  • show_subgrid – if True, or if None and options.show_subgrid, grid is displayed aligned with the minor axes.

  • grid_alpha – transparency (between 0 and 1) of the main grid, if displayed.

  • subgrid_alpha – transparency (between 0 and 1) of the subgrid, if displayed.

  • semilog_y – if True, a logarithmic scale is used in the Y axis.

  • semilog_y_base – if semilog_y is True, the logarithm base employed.

  • semilog_y_min_bound – if semilog_y is True, make y_min the maximum of y_min and this value.

  • group_row_margin – if provided, this margin is applied between rows of groups

Axis limits:

Parameters:
  • x_min – if not None, force plots to have this value as left end.

  • x_max – if not None, force plots to have this value as right end.

  • horizontal_margin – Horizontal margin to be added to the figures, expressed as a fraction of the horizontal dynamic range.

  • vertical_margin – Vertical margin to be added to the figures, expressed as a fraction of the horizontal dynamic range.

  • y_min – if not None, force plots to have this value as bottom end.

  • y_max – if not None, force plots to have this value as top end.

  • global_y_label_margin – if not None, the padding to be applied between the global_y_label and the y axis (if such label is enabled).

Optional axis labeling:

Parameters:
  • y_labels_by_group_name – if not None, a dictionary of labels for the groups, indexed with the same keys as pds_by_group_name.

  • tick_direction – direction of the ticks in the plot. Can be “in”, “out” or “inout”.

  • x_tick_list – if not None, these ticks will be displayed in the x axis.

  • x_tick_label_list – if not None, these labels will be displayed in the x axis. Only used when x_tick_list is not None.

  • x_tick_label_angle – when label ticks are specified, they will be rotated to this angle

  • y_tick_list – if not None, these ticks will be displayed in the y axis.

  • y_tick_label_list – if not None, these labels will be displayed in the y axis. Only used when y_tick_list is not None.

  • left_y_label – if True, the group label is shown to the left instead of to the right

Additional plottable data:

Parameters:

extra_plds – an iterable of additional PlottableData instances to be rendered in all subplots.

Global title:

Parameters:
  • plot_title – title to be displayed.

  • title_y – y position of the title, when displayed. An attempt is made to automatically situate it without overlapping with the axes or the legend.

  • show_legend – if True, legends are added to the plot when one or more PlottableData instances contain a label

  • legend_position – position of the legend (if shown). It can be “title” to display it above the plot, or any matplotlib-recognized argument to the loc argument of legend().

Matplotlib styles:

Parameters:

style_list – list of valid style arguments recognized by matplotlib.use. Each element can be any of matplotlib’s default styles or a path to a valid matplotlibrc. Styles are applied from left to right, overwriting definitions without warning. By default, matplotlib’s “default” mode is applied.

enb.sets module

Locate, analyze, expose and catalogue dataset entries.

The FilePropertiesTable class contains the minimal information about the file as well as basic statistical measurements.

Subclasses of this table can be created adding extra columns.

The experiment.CompressionExperiment class takes an instance of FilePropertiesTable to know what files the experiment should be run on.

class enb.sets.FilePropertiesTable(csv_support_path=None, base_dir=None)

Bases: ATable

Table describing basic file properties (see decorated methods below).

__init__(csv_support_path=None, base_dir=None)
Parameters:
  • index – string with column name or list of column names that will be used for indexing. Indices provided to self.get_df must be either one instance (when a single column name is given) or a list of as many instances as elements are contained in self.index. See `self.indices.

  • csv_support_path – path to a file where this ATable contents are to be stored and retrieved. If None, persistence is disabled.

  • column_to_properties – if not None, it is a mapping from strings to callables that defines the columns of the table and how to obtain the cell values

  • progress_report_period – if not None, it must be a positive number of seconds that are waited between progress report messages (if applicable).

base_dir = None
column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FilePropertiesTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

dataset_files_extension = 'raw'

Default input sample extension. If affects the result of enb.atable.get_all_test_files,

get_df(target_indices=None, target_columns=None, fill=True, overwrite=None, chunk_size=None)

Core method for all enb.atable.ATable subclasses to obtain the table’s content. The following principles guide the way get_df works:

  • This method returns a pandas.DataFrame containing one row per element in target_indices, and as many columns as there are defined in self.column_to_properties. If target_indices is None, all files in enb.config.options.base_dataset_dir are used (after filtering by self.dataset_files_extension) by default.

  • Any persistence data already present is loaded, and only new indices or columns are added. This way, each column-setting function needs to be called only once per index for any given enb.atable.ATable subclass.

  • Results are returned only for target_indices, even if you previously computed other rows. Thus, only not-already-present indices and new columns require actual computation. Any new result produced by this call is appended to the already existing persistence data.

  • Rows computed in a previous call to this get_df are not deleted from persistent data storage, even if target_indices contains fewer or different indices than in previous calls.

  • Beware that if you remove a column definition from this enb.atable.ATable subclass and run get_df, that column will be removed from persistent storage. If you add a new column, that value will be computed for all rows in target_indices,

  • You can safely select new and/or different target_indices. New data are stored, and existent rows are not removed. If you add new column definitions, those are computed for target_indices only. If there are other previously existing rows, they are flagged as incomplete, and those new columns will be computed only when those rows’ indices are included in target_indices.

Recall that table cell values are restricted to be numeric, string, boolean or non-scalar, i.e., list, tuple or dict.

Parameters:
  • target_indices – list of indices that are to be contained in the table, or None to infer automatically from the dataset.

  • target_columns – if not None, it must be a list of column names (defined for this class) that are to be obtained for the specified indices. If None, all columns are used.

  • fill – If True or False, it determines whether values are computed for the selected indices. If None, values are only computed if enb.config.options.no_new_results is False.

  • overwrite – values selected for filling are computed even if they are present in permanent storage. Otherwise, existing values are skipped from the computation.

  • chunk_size – If None, its value is assigned from options.chunk_size. After this, if not None, the list of target indices is split in chunks of size at most chunk_size elements (each one corresponding to one row in the table). Results are made persistent every time one of these chunks is completed. Setting chunk_size to -1 is functionally identical to setting it to None (or to the number of target indices), but it does not display “Starting chunk 1/1…” (useful if the chunk partitioning is performed outside, i.e., by an Experiment class).

  • progress_tracker – if not None, the enb.progress.ProgressTracker instance being used to keep track of an ATable instance at an upper level.

Returns:

a DataFrame instance containing the requested data

Raises:

CorruptedTableError, ColumnFailedError, when an error is encountered processing the data.

get_relative_path(file_path)

Get the relative path. Overwritten to handle the versioned path.

hash_field_name = 'sha256'
index_name = 'file_path'
set_corpus(file_path, row)

Store the corpus name of a data sample. By default, it is the name of the folder in which the sample is stored.

Symbolic links can be used within the base dataset dir (./datasets by default). In that case, they are treated as a regular file with the path relative to the project.

set_file_size(file_path, row)

Store the original file size in row. :param file_path: path to the file to analyze :param row: dictionary of previously computed values for this file_path

(to speed up derived values).

set_hash_digest(file_path, row)

Store the hexdigest of file_path’s contents, using hash_algorithm as configured. :param file_path: path to the file to analyze. :param row: dictionary of previously computed values for this file_path (to speed up derived values).

version_name = 'original'
class enb.sets.FileVersionTable(version_base_dir, version_name='', original_properties_table=None, original_base_dir=None, csv_support_path=None, check_generated_files=True)

Bases: FilePropertiesTable

Table with the purpose of converting an input dataset into a destination folder. This is accomplished by calling the version() method for all input files. Subclasses may be defined so that they inherit from other classes and can apply more complex versioning.

__init__(version_base_dir, version_name='', original_properties_table=None, original_base_dir=None, csv_support_path=None, check_generated_files=True)
Parameters:
  • version_base_dir – path to the versioned base directory (versioned directories preserve names and structure within the base dir)

  • version_name – arbitrary name of this file version

  • original_base_dir – path to the original directory (it must contain all indices requested later with self.get_df()). If None, enb.config.options.base_dataset_dir is used

  • original_properties_table – instance of the file properties subclass to be used when reading the original data to be versioned. If None, a FilePropertiesTable is instanced automatically.

  • csv_support_path – path to the file where results (of the versioned data) are to be long-term stored. If None, one is assigned by default based on options.persistence_dir.

  • check_generated_files – if True, the table checks that each call to version() produces a file to output_path. Set to false to create arbitrarily named output files.

column_to_properties = {'corpus': ColumnProperties('name'='corpus', 'fun'=<function FileVersionTable.set_corpus>, 'label'='Corpus name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'original_file_path': ColumnProperties('name'='original_file_path', 'fun'=<function FileVersionTable.set_original_file_path>, 'label'='Original file path', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'sha256': ColumnProperties('name'='sha256', 'fun'=<function FilePropertiesTable.set_hash_digest>, 'label'='sha256 hex digest', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'size_bytes': ColumnProperties('name'='size_bytes', 'fun'=<function FilePropertiesTable.set_file_size>, 'label'='File size (bytes)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_name': ColumnProperties('name'='version_name', 'fun'=<function FileVersionTable.column_version_name>, 'label'='Version name', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False), 'version_time': ColumnProperties('name'='version_time', 'fun'=<function FileVersionTable.set_version_time>, 'label'='Versioning time (s)', 'semilog_x'=False, 'semilog_y'=False, 'semilog_x_base'=10, 'semilog_y_base'=10, 'has_dict_values'=False, 'has_iterable_values'=False, 'has_object_values'=False)}

The column_properties attribute keeps track of what columns have been defined, and the methods that need to be called to computed them. The keys of this attribute can be used to determine the columns defined in a given class or instance. The values are enb.atable.ColumnProperties instances, which can be set manually after definition and before calling enb.aanalysis.Analyzer subclasses’ get_df.

column_version_name(file_path, row)

Automatically add the version name as a column

get_default_target_indices()

Get the list of samples in self.original_base_dir and its subdirs that have extension self.dataset_files_extension.

get_df(target_indices=None, fill=True, overwrite=None, target_columns=None)

Create a version of target_indices (which must all be contained in self.original_base_dir) into self.version_base_dir. Then return a pandas DataFrame containing all given indices and defined columns. If fill is True, missing values will be computed. If fill and overwrite are True, all values will be computed, regardless of whether they are previously present in the table.

Parameters:
  • overwrite – if True, version files are written even if they exist

  • target_indices – list of indices that are to be contained in the table, or None to use the list of files returned by enb.atable.get_all_test_files()

  • target_columns – if not None, the list of columns that are considered for computation

original_to_versioned_path(original_path)

Get the path of the versioned file corresponding to original_path. This function will replicate the folder structure within self.original_base_dir.

set_corpus(file_path, row)

Store the corpus name of a data sample. By default, it is the name of the folder in which the sample is stored.

Symbolic links can be used within the base dataset dir (./datasets by default). In that case, they are treated as a regular file with the path relative to the project.

set_original_file_path(file_path, row)

Store the path of the original path being versioned.

set_version_time(file_path, row)

Run self.version() and store the wall version time.

version(input_path, output_path, row)

Create a version of input_path and write it into output_path.

Parameters:
  • input_path – path to the file to be versioned

  • output_path – path where the version should be saved

  • row – metainformation available using super().get_df for input_path

Returns:

if not None, the time in seconds it took to perform the ( forward) versioning.

enb.sets.parallel_version_one_path(version_fun, input_path, output_path, overwrite, original_info_df, check_generated_files)

Run the versioning of one path.

enb.sets.version_one_path_local(version_fun, input_path, output_path, overwrite, original_info_df, check_generated_files)

Version input_path into output_path using version_fun.

Parameters:
  • version_fun – function with signature like FileVersionTable.version

  • input_path – path of the file to be versioned

  • output_path – path where the versioned file is to be stored

  • overwrite – if True, the version is calculated even if output_path already exists

  • original_info_df – DataFrame produced by a FilePropertiesTable instance that contains an entry for atable.indices_to_internal_loc().

  • check_generated_files – flag indicating whether failing to produce output_path must raise an exception.

Returns:

a tuple (output_path, l), where output_path is the selected otuput path and l is a list with the obtained versioning time. The list l shall contain options.repetitions elements. NOTE: If the subclass version method returns a value, that value is taken as the time measurement.

enb.tarlite module

Lite archiving format to write several files into a single one.

class enb.tarlite.TarliteReader(tarlite_path)

Bases: object

Extract files created by TarliteWriter.

__init__(tarlite_path)
extract_all(output_dir_path)

Extract all files to output_dir_path.

class enb.tarlite.TarliteWriter(initial_input_paths=None)

Bases: object

Input a series of file paths and output a single file with all the inputs contents, plus some metainformation to reconstruct them. Files are stored flatly, i.e., only names are stored, discarding any information about their contining dirs.

__init__(initial_input_paths=None)
add_file(input_path)

Add a file path to the list of pending ones. Note that files are not read until the write() method is invoked.

write(output_path)

Save the current list of input paths into output_path.

enb.tarlite.tarlite_files(input_paths, output_tarlite_path)

Take a list of input paths and combine them into a single tarlite file.

enb.tarlite.untarlite_files(input_tarlite_path, output_dir_path)

Take a tarlite file and output the contents into the given directory.

enb.tcall module

Timed calls to subprocess, so that real execution times can be obtained.

exception enb.tcall.InvocationError

Bases: Exception

Raised when an invocation fails.

enb.tcall.get_status_output_time(invocation, expected_status_value=0, wall=None, timeout=None)

Run invocation, and return its status, output, and total (wall or user+system) time in seconds.

Parameters:
  • expected_status_value – if not None, status must be equal to this value or an InvocationError is raised.

  • wall – if True, execution wall time is returned. If False, user+system CPU time is returned. (both in seconds). If None, the value of enb.config.options.report_wall_time is used.

  • timeout – if not None and not 0, an exception is raised if the execution exceeds this value

Returns:

status, output, time

enb.tcall.get_status_output_time_memory(invocation, expected_status_value=0, wall=None, timeout=None)

Run invocation, and return its status, output, and total (wall or user+system) time in seconds.

Parameters:
  • expected_status_value – if not None, status must be equal to this value or an InvocationError is raised.

  • wall – if True, execution wall time is returned. If False, user+system CPU time is returned. (both in seconds). If None, the value of enb.config.options.report_wall_time is used.

  • timeout – if not None and not 0, an exception is raised if the execution exceeds this value

Returns:

status, output, time, used_memory_kb

Module contents

Experiment notebook (enb) library.

Please see https://github.com/miguelinux314/experiment-notebook for further information.