Remote Data

Remote Data#

Overview#

Paidiverpy supports remote data sources for input_path, output_path, and metadata_path in the configuration file. This allows you to work with data stored in cloud storage services, such as Amazon S3, without needing to download it locally. The package can handle both public and private data sources, enabling seamless integration with various data storage solutions. This section explains how to configure and use remote data sources in your pipeline.

Configuring Remote Data#

To work with remote data, specify the data source path in the configuration file. The path can be a public or private URL. If the data source is private, credentials may be required.

### Configuration Parameters

The configuration file supports the following parameters:

  • input_path: Location of input data.

  • output_path: Destination for processed data.

  • metadata_path: Path to the metadata file.

### Public vs. Private Data Sources

#### Public URLs If the remote location has a public URL (e.g., starting with https://), no credentials are needed.

Example:

general:
  input_path: 'https://paidiver-o.s3-ext.jc.rl.ac.uk/paidiverpy/data/lazy_load_benthic/'
  output_path: 's3://bucket-name/output/data/path/'
  metadata_path: "https://paidiver-o.s3-ext.jc.rl.ac.uk/paidiverpy/data/lazy_load_benthic/metadata_ifdo_hf.json"
  metadata_type: 'IFDO'
  image_open_args: 'JPG'

steps:
  # Define pipeline steps

In this case, you can read data and metadata without authentication. However, credentials are required if saving output to a private S3 bucket (if you set output_path to a private S3 location).

#### Private URLs For private remote locations (e.g., starting with s3://), you must provide credentials as environment variables:

  • OS_SECRET

  • OS_TOKEN

  • OS_ENDPOINT

Failure to provide credentials when accessing private data will result in an error.

Example:

general:
  input_path: 's3://bucket-name/input/data/path/'
  output_path: 's3://bucket-name/output/data/path/'
  metadata_path: 's3://bucket-name/metadata/path/metadata.json'
  metadata_type: 'IFDO'
  image_open_args: 'JPG'

steps:
  # Define pipeline steps

In this case, ensure the necessary environment variables are set before running the pipeline.

### Saving Output Data At the end of the pipeline, if output saving is enabled, the processed images will be stored in the specified S3 output directory.

Note

For details on the configuration file format, refer to the Configuration File section.