Dataset Model¶

Field	Description	Settable
`dataset_name`	Human-readable name for the dataset	create, update
`project_id`	Project this dataset belongs to	create, update
`measurement`	Industry-standard experiment type (e.g. `"Raman Spectroscopy"`)	create, update
`data_type`	Institution-specific data organization descriptor (e.g. `"ScopeFoundry H5 file"`)	create, update
`instrument_name`	Name of the instrument as registered in Crucible	create, update
`data_format`	File type or extension (e.g. `"h5"`, `"dat"`)	create, update
`session_name`	Optional tag grouping datasets collected in the same session	create, update
`timestamp`	When the data was collected (ISO 8601 format)	create, update
`public`	Whether the dataset is publicly accessible (default: `False`)	create, update
`owner_orcid`	Dataset owner — defaults to the authenticated user	create, update
`unique_id`	System-assigned MFID identifier	server-assigned
`size`	Total file size in bytes	server-assigned
`creation_time`	When the record was created	server-assigned
`modification_time`	When the record was last modified	server-assigned

Relationships¶

Relationship	Key(s)	Description
Files	`files_to_upload` in `create()`; `add_file_to_dataset(dsid, file_path)` to add later	Zero or more files can be attached to a dataset. Each file is uploaded to cloud storage and triggers an ingestion process to parse metadata and generate thumbnails.
Scientific metadata	`scientific_metadata` in `create()`; `metadata` in `update_scientific_metadata()` / `replace_scientific_metadata()`	A free-form JSON object for experiment-specific parameters. Stored separately from structured fields and searchable across datasets.
Thumbnails	`add_thumbnail(dsid, image)`	Small preview images representing the data or results. Generated automatically by ingestors where supported, or uploaded manually.
Samples	`sample_id` in `add_sample(dataset_id, sample_id)`	A dataset can be linked to one or more samples, and a sample to one or more datasets — capturing which material was measured.
Parent/child datasets	`parent_dataset_id`, `child_dataset_id` in `link_parent_child()`	Datasets can be linked in a directed hierarchy to represent processing pipelines (e.g. raw → calibrated → analyzed).

Working with Datasets¶

Creating a dataset¶

Pass a Dataset model and optional files to client.datasets.create():

from crucible.models import Dataset

result = client.datasets.create(
    dataset=Dataset(
        dataset_name="XRD run 5",
        measurement="X-ray diffraction",
        instrument_name="Beamline 12.3.2",
        project_id="my-project",
    ),
    files_to_upload=["xrd_run5.xy"],
    scientific_metadata={"wavelength_angstrom": 0.7749, "temperature_K": 300},
    keywords=["XRD", "powder"],
)

dsid = result["dsid"]

You can upload multiple files in one call:

result = client.datasets.create(
    dataset=Dataset(dataset_name="Multi-file dataset", project_id="my-project"),
    files_to_upload=["file1.dat", "file2.dat", "thumbnail.png"],
)

What happens when you call create()

create() is a client-side convenience method that chains several API calls:

POST /datasets — creates the dataset record and returns a dsid
POST /resources/{dsid}/metadata — adds scientific metadata (if provided)
POST /datasets/{dsid}/keywords — adds each keyword individually (if provided)
Uploads each file via GCS and triggers an ingestion request per file (if files_to_upload is provided)

Retrieving a dataset¶

ds = client.datasets.get("ds-abc123")
ds_with_metadata = client.datasets.get("ds-abc123", include_metadata=True, include_links=True)

Listing datasets¶

# All datasets in a project
datasets = client.datasets.list(project_id="my-project", limit=50)

# Filter by measurement type
datasets = client.datasets.list(project_id="my-project", measurement="SEM imaging")

# Filter by keyword
datasets = client.datasets.list(project_id="my-project", keywords=["gold"])

# Datasets linked to a specific sample
datasets = client.datasets.list(sample_id="sm-xyz789")

Updating a dataset¶

client.datasets.update(
    "ds-abc123",
    dataset_name="XRD run 5 (corrected)",
    measurement="Powder X-ray diffraction",
)

Adding files to a dataset¶

Add files to an existing dataset:

client.datasets.add_file_to_dataset("ds-abc123", "additional_file.dat")

How the data ingestion process works¶

When a file is added to a dataset, three things happen:

The file is uploaded to cloud storage
A file record is created in the database linked to the dataset
An ingestion request is sent to the backend

Data type-specific ingestion classes parse scientific metadata, structured metadata, and thumbnails from the file. If an ingestion class is specified via the ingestor parameter, that class is used. Otherwise available classes are scanned from most to least specific, if there are no ingestion classes that support the data type, then metadata and thumbnails will not be extracted from the file.

Ingestors will not overwrite the dataset attributes provided at dataset creation. Structured metadata for the primary dataset record (eg. timestamp, dataset_name, data_type) can be updated using the client.datasets.update() method.

For updates to the scientific metadata, the ingestion process uses the client.datasets.update_scientific_metadata(overwrite = False) method. As a result, new key-value pairs parsed during the ingestion process will be appended to the existing scientific_metadata and newly parsed values for existing keys will be updated. If you would like to replace the entire scientific_metadata dictionary, it can be done manally with update_scientific_metadata(overwrite=True).

Files are deduplicated by sha256 hash. If you add the same file twice it will not be reuploaded, but ingestion will be re-requested — this operation is idempotent.

Warning

If two files with the same name but different contents are added to the same dataset, the upload proceeds but replaces the original file in cloud storage. A new file record is created with a new mfid and hash; the old record remains but its download link points to the new file. We are actively working on updated logic to address this.

If no ingestion class exists for your data type, reach out on Discord or contribute to the crucible-ingestion repository.

Scientific metadata¶

Scientific metadata stores experiment-specific parameters as a free-form JSON object.

# Merge new keys into existing metadata (PATCH — appends/updates individual keys)
client.datasets.update_scientific_metadata(
    "ds-abc123",
    metadata={"temperature_K": 300, "pressure_bar": 1.0, "scan_rate_mV_s": 50},
)

# Retrieve it
meta = client.datasets.get_scientific_metadata("ds-abc123")

# Search across all datasets
results = client.datasets.search_scientific_metadata("temperature", limit=20)

update_scientific_metadata() merges new keys into the existing metadata (PATCH). To replace all existing metadata entirely, use replace_scientific_metadata():

# Replace all metadata entirely (POST)
client.datasets.replace_scientific_metadata("ds-abc123", {"new_key": "value"})

Keywords¶

client.datasets.add_keyword("ds-abc123", "annealed")
keywords = client.datasets.get_keywords(dataset_id="ds-abc123")

Thumbnails¶

client.datasets.add_thumbnail("ds-abc123", "preview.png")
thumbnails = client.datasets.get_thumbnails("ds-abc123")

Downloading¶

# Download all files for a dataset
client.datasets.download("ds-abc123", output_dir="./downloads")

# Download only matching files
client.datasets.download("ds-abc123", output_dir="./downloads", include="*.dat")

# Get pre-signed download URLs (valid ~1 hour)
links = client.datasets.get_download_links("ds-abc123")

Parent-child relationships between datasets¶

Link datasets to represent a processing pipeline:

# raw → processed
client.datasets.link_parent_child(parent_id=raw_dsid, child_id=processed_dsid)

# List relationships
parents = client.datasets.list_parents("ds-processed")
children = client.datasets.list_children("ds-raw")

Deleting a dataset¶

client.datasets.delete("ds-abc123")

Note

Calling delete() submits a deletion request — it does not immediately remove the resource. An admin must approve the request before the dataset is deleted.