Will data get duplicated upon re-running code?¶
LaminDB’s operations are idempotent in the sense defined here, which allows you to re-run code without duplicating data.
Records with name
field
When you instantiate Record
with a name, in case a name has an exact match in a registry, the constructor returns it instead of creating a new record. In case records with similar names exist, you’ll see them in a table: you can then decide whether you want to save the new record or pick an existing record.
If you set search_names
to False
, you bypass these checks.
Artifacts & collections
If you instantiate Artifact
from data that already exists as an artifact, the Artifact()
constructor returns the existing artifact based on a hash lookup.
# pip install 'lamindb[jupyter]'
!lamin init --storage ./test-idempotency
Show code cell output
→ initialized lamindb: testuser1/test-idempotency
import lamindb as ln
ln.track("ANW20Fr4eZgM0000")
Show code cell output
→ connected lamindb: testuser1/test-idempotency
→ created Transform('ANW20Fr4eZgM0000'), started new Run('yCZJvhsK...') at 2025-03-10 11:51:58 UTC
→ notebook imports: lamindb==1.2.0
Records with name field¶
Show code cell content
assert ln.settings.creation.search_names
Let us add a first record to the ULabel
registry:
label = ln.ULabel(name="My label 1").save()
If we create a new record, we’ll automatically get search results that give clues on whether we are prone to duplicating an entry:
label = ln.ULabel(name="My label 1a")
Show code cell output
! record with similar name exists! did you mean to load it?
uid | name | is_type | description | reference | reference_type | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1 | IqLEVMTt | My label 1 | False | None | None | None | 1 | None | 1 | 2025-03-10 11:52:00.136000+00:00 | 1 | None | 1 |
Let’s save the 1a
label, we actually intend to create it.
label.save()
Show code cell output
ULabel(uid='vubDzkz6', name='My label 1a', is_type=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-10 11:52:00 UTC)
In case we match an existing name directly, we’ll get the existing object:
label = ln.ULabel(name="My label 1")
Show code cell output
→ returning existing ULabel record with same name: 'My label 1'
If we save it again, it will not create a new entry in the registry:
label.save()
ULabel(uid='IqLEVMTt', name='My label 1', is_type=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-10 11:52:00 UTC)
Now, if we create a third record, we’ll get two alternatives:
label = ln.ULabel(name="My label 1b")
! records with similar names exist! did you mean to load one of them?
uid | name | is_type | description | reference | reference_type | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1 | IqLEVMTt | My label 1 | False | None | None | None | 1 | None | 1 | 2025-03-10 11:52:00.136000+00:00 | 1 | None | 1 |
2 | vubDzkz6 | My label 1a | False | None | None | None | 1 | None | 1 | 2025-03-10 11:52:00.199000+00:00 | 1 | None | 1 |
If we prefer to not perform a search, e.g. for performance reasons, we can switch it off.
ln.settings.creation.search_names = False
label = ln.ULabel(name="My label 1c")
Switch it back on:
ln.settings.creation.search_names = True
Artifacts & collections¶
filepath = ln.core.datasets.file_fcs()
Create an Artifact
:
artifact = ln.Artifact(filepath, key="my_fcs_file.fcs").save()
Show code cell content
assert artifact.hash == "rCPvmZB19xs4zHZ7p_-Wrg"
assert artifact.run == ln.context.run
assert not artifact._subsequent_runs.exists()
Create an Artifact
from the same path:
artifact2 = ln.Artifact(filepath, key="my_fcs_file.fcs")
Show code cell output
→ returning existing artifact with same hash: Artifact(uid='VzQv0UhRROfIvx3f0000', is_latest=True, key='my_fcs_file.fcs', suffix='.fcs', size=19330507, hash='rCPvmZB19xs4zHZ7p_-Wrg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-03-10 11:52:00 UTC); to track this artifact as an input, use: ln.Artifact.get()
It gives us the existing object:
assert artifact.id == artifact2.id
assert artifact.run == artifact2.run
assert not artifact._subsequent_runs.exists()
If you save it again, nothing will happen (the operation is idempotent):
artifact2.save()
Show code cell output
Artifact(uid='VzQv0UhRROfIvx3f0000', is_latest=True, key='my_fcs_file.fcs', suffix='.fcs', size=19330507, hash='rCPvmZB19xs4zHZ7p_-Wrg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-03-10 11:52:00 UTC)
In the hidden cell below, you’ll see how this interplays with data lineage.
Show code cell content
ln.context.track(new_run=True)
artifact3 = ln.Artifact(filepath, key="my_fcs_file.fcs")
assert artifact3.id == artifact2.id
assert artifact3.run == artifact2.run != ln.context.run # run is not updated
assert artifact2._subsequent_runs.first() == ln.context.run
→ loaded Transform('ANW20Fr4eZgM0000'), started new Run('Ms8EATZ7...') at 2025-03-10 11:52:00 UTC
→ notebook imports: lamindb==1.2.0
→ returning existing artifact with same hash: Artifact(uid='VzQv0UhRROfIvx3f0000', is_latest=True, key='my_fcs_file.fcs', suffix='.fcs', size=19330507, hash='rCPvmZB19xs4zHZ7p_-Wrg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-03-10 11:52:00 UTC); to track this artifact as an input, use: ln.Artifact.get()
Show code cell content
!rm -rf ./test-idempotency
!lamin delete --force test-idempotency
• deleting instance testuser1/test-idempotency