artifact
A representation of a file and its metadata in the SI database, specifically the inventory.Artifact table (see the SI data model). The term artifact is often used to refer to both the database representation of a file and the file itself as one thing, but when an SI application is acting on an artifact, it specifically refers to the database.
bucket
SI assigns a bucket
label to files on storage and to artifacts in the inventory database.
a74
.04c
are not necessarily the same as artifacts in uri bucket 04c
.namespace
A namespace is an SI identifier for a collection of artifacts, and can be used to define the logical structure of data within SI. For example, cadc:CFHT/
might be used to identify all CFHT files held at the CADC; cadc:CFHT/raw/
might be used to identify all of the raw CFHT files held at the CADC -- both cadc:CFHT/
and cadc:CFHT/raw/
are namespaces but they identify different scopes of CFHT artifacts. SI services and applications often act on namespaces defined using regex patterns -- such as during file replication and determining access permissions -- so some thought must be put into what is used. See the SI data model page for more detail on the concepts of URI and namespace.
:
or /
.
cadc:TEST/data/raw/example/test1.fits
:
cadc:
cadc:TEST/
cadc:TEST/data/
cadc:TEST/data/raw/
cadc:TEST/data/raw/example/
cadc:test1.fits
) is not recommended.
:
is referred to as the 'scheme' elsewhere in the documentation. This part of the namespace usually refers to an organization (e.g. cadc:DAO
, mast:JWST
)resourceID
This is an unique ID for a deployed service. A registry
service provides a look-up to translate these IDs into service URLs. Example: ivo://opencadc.org/minoc
, which might resolve to https://www.opencadc.org/minoc
.
resourceID
is a label which identifies a service in an abstract way.
resourceIDs
may be shared within a site -- used by local services and applications -- but may also be shared across an organization. Because the scope of a site's resourceIDs could be broad, some thought needs to be put into their definition.ivo:
scheme in the resourceID
means that the registry
service that will be used to resolve the resourceID
complies with the IVOA registry standard. The available CADC reg
service is an example of an IVOA registry implementation.https://images.opencadc.org/<project>/<image>:<tag>
, where the list of projects, images, and tags can be found using the commands below. Images specific to Storage Inventory will be found in the storage-inventory
project, although images for supporting services may be in other projects (i.e. in core
).jq
is a helpful command for parsing JSON):curl -s https://images.opencadc.org/api/v2.0/projects | jq '[.[].name] | sort'
curl -s https://images.opencadc.org/api/v2.0/projects/storage-inventory/repositories | jq '[.[].name] | sort'
curl -s https://images.opencadc.org/api/v2.0/projects/storage-inventory/repositories/minoc/artifacts | jq '[.[].tags | select (. != null) | .[].name] | sort'
0.9.0
) or version-datetimestamp (e.g. 0.9.0-20230217T201656
)x.y.z
:
x
will change with major releases -- functionality, api, and configuration may change and break earlier configuration.y
will change with minor releases -- minor non-breaking or backwards-compatible feature or configuration changes.z
will change with minor bug fixes -- otherwise compatible with current version features and configuration.Storage Inventory consists of the components that make up one or more Storage sites and a Global site. A Storage site can exist on its own, as a mechanism for maintaining a structured inventory of files. A Global site is required when there are two or more Storage sites which need to be synchronized; it also provides a single site for users to go to find all available copies of a file. A detailed description of the data model, features and limitations can be found here.
In general:
A Storage site maintains an inventory of the files stored at a particular location, and provides mechanisms to access (minoc) those files and query (luskan) the local inventory. Below is an outline of a stand-alone (no Global site) Storage Inventory Storage site, with one storage system, one database, etc, in one data centre. If you have files in multiple data centres, or more than one storage platform in one data centre (e.g. some files on a posix file-system and some on Ceph object storage), you would have more than one Storage site, and each site would run its own services, database, storage, and applications.
A standalone Storage Inventory Storage site will consist of following:
minoc
): provides a REST based file service that supports HEAD, GET, PUT, POST, DELETE operations.luskan
): provides a Web Service for querying artifact metadata contained in the Inventory database, using the IVOA Table Access Protocol
(TODO: Replace TAP link with user document, not reference to spec....)tantar
): Artifact validation application that compares the inventory database with the contents of the back-end storage.ringhold
): Removes the local copy of artifacts. Use with caution: essentially the same as \rm -r
on a namespace at a storage site.Registry
service: Used to map resourceIDs to the actual URLs where the service is deployed. Client software, services, and applications will use a registry to look up the locations of services. The linked cadc-registry-server
is provided as an example implementation.baldur
): permissions service which uses configurable rules to grant access based on resource identifiers (Artifact.uri values or namespaces).
This service is required if Authentication and Authorization (A&A) is required for the SI deployment. Generally, baldur works along with a Group Membership Service (GMS) and/or User Service.GMS
): . Needed for providing the IVAO group membership look-up API used by baldur
and other services when determining access permissions. For an example implementation, built on top of Indigio IAM, see https://gitlab.com/ska-telescope/src/group-membership-service (more implementation details to follow).If you need to replicate files among multiple Storage Sites, you will need a Global Site. The Global site maintains a view of all Storage sites, allowing individual Storage sites to discover files that they need to copy. This also provides a single site which users can query to find files, rather than having to know about and search individual Storage sites.
A Global site will be required different services than a Storage site, and both Storage sites and Global sites will need to run additional applications to synchronize metadata and files.
fenwick
but a Global site will need to run an instance of fenwick
for each Storage site it needs to track. See the Metadata synchronization description below.fenwick
a Storage site will only need to run one instance of ratik
but a Global site will run an instance of ratik
for each Storage site it is tracking.raven
service for the locations of those files, and download them to the local storage. See the File synchronization description below.site1.minoc
service, either directly or via negotiation with a global raven
service.global.fenwick.site1
discovers the new inventory metadata for the file by querying site1.luskan
.site2.fenwick.global
discovers the new inventory metadata for the file by querying global.luskan
.site2.critwall
finds the locations of the new file via global.raven
-- this returns a list of URLs from which the file can be downloaded.site2.critwall
downloads the file from site1.minoc
.Generic HTTP client tools such as curl
or wget
can be used to interact with the SI, however multi-step operations such as transfer negotiations or transfer of large files with
SI transactions might require dedicated scripts. (TODO provide examples of usage)
Alternatively, the CADC maintains Python client applications/libraries that can be used with the SI:
cadcdata
- for file operations with minoc and raven. That includes transfer negotiations, file uploads, downloads or deletes. Or simply file information. The package takes advantage of
the SI features to offer robust and fault tollerant transfer of files small or large.cadctap
- for querying the artifact metadata. It works with a luskan service delopyed at a site or the global one. Alternatively, any generic TAP-based tool can be used to query luskan including (but not limitted to) PyVO
The CADC Direct Data Service
presents a variety of scenarios for accessing the CADC SI using generic and specific client tools.
Database:
Storage platform:
Required for: Storage site only
Ceph Object store (version 14 or greater)
OR
POSIX file-system.
Worker nodes:
haveged
(or other entropy-generating service) this is only necessary on hosts running the services.Note on logging: Storage inventory services and application containers all log to stdout
by default -- for a production deployment, these should be captured and preserved by whatever mechanism is available on your system.
for POSIX storage, the storage file-system will need to be mounted directly into the containers (e.g. a 'volume' path in Docker or a PVC in kubernetes). Since the storage will be mounted by several containers, it will need to be a shared file-system which supports writes from multiple hosts.
❗NOTE: the services in the containers run as a user with a UID:GID of 8675309:8675309. This user must be allowed to read and write files on the configured file-system. This is usually done by ensuring a non-privileged (or even 'nologin') user is configured on your system with this UID:GID.
in the cadc-storage-adapter-fs.properties
configuration file for POSIX storage :
org.opencadc.inventory.storage.fs.baseDir
parameter must point to the location that the storage is mounted inside the container. For example,docker run --user tomcat:tomcat -v /path/on/host:/mountpoint/in/container minoc:0.9.2
orapiVersion: app/v1
kind: deployment
<...snip....>
volumeMounts:
- mountPath: "/mountpoint/in/container"
name: lustre-volume
securityContext:
runAsUser: 8675309
runAsGroup: 8676309
volumes:
- name: lustre-ceph-volume
hostPath:
path: /path/on/host
type: Directory
org.opencadc.inventory.storage.fs.OpaqueFileSystemStorageAdapter.bucketLength
sets the depth of the directory tree created to store files. At each node in the tree, 16 hex (0-f) directories are created -- a bucketLength
of 2 will create 16 directories (0-f) each with sixteen subdirectories (0-f) -- only the 256 (16x16) subdirectories at the bottom of the tree will be used to store files. For efficiency of validation, you should choose a bucketLength
which results in only a few thousand files in each directory. e.g, for a bucketLength=3
and baseDir = /mount/in/container
:[container]$ ls -F /mount/in/container/
0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ a/ b/ c/ d/ e/ f/ # Depth=1/3
[container]$ ls -F /mount/in/container/a/
0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ a/ b/ c/ d/ e/ f/ # Depth=2/3
[container]$ ls -F /mount/in/container/a/7/
0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ a/ b/ c/ d/ e/ f/ # Depth=3/3
[container]$ ls -F /mount/in/container/a/7/4/
test0001.fits test0002.fits test0003.fits test0004.fits test0005.fits
a74
, and there will be a total of 4096 (16x16x16) buckets.cadc-storage-adapter-swift.properties
configuration file for Swift storage:
org.opencadc.inventory.storage.swift.SwiftStorageAdapter.bucketLength
sets the number of hex characters in the configured buckets (e.g. a74
), and the total number of buckets (i.e. a bucketLength
of 3 will create 16^3 (4096) buckets). Configure the bucketLength
so the expected number of files per bucket is no more than a few thousand.In the following, the database being created is called si_db
, but you can change that name as you see fit. Whatever you choose, it will need to be referenced in the service and application configuration.
initdb -D /var/lib/postgresql/data --encoding=UTF8 --lc-collate=C --lc-ctype=C
-D
), depending on your postgres installation and hardware layout.As the postgres user, create a file named si.dll with the linked content, edit as appropriate, and run psql -f si.dll -a
tapadm
) - privileged user. Manages the tap schema with permissions to create, alter, and drop tables. Used by:
tapuser
) - unprivileged user. Used by the luskan
service to query the inventory database. Used by:
invadm
) - privileged user. Manages the inventory schema with privileges to create, alter, and drop tables, and is also used to insert, update, and delete rows in the inventory tables. Used by:
openssl 1.0.2k
or a compatible version. Newer versions of openssl
do not support proxy certificates.OPENSSL_ALLOW_PROXY_CERTS=1
needs to be set in the proxy environment.core/reg
image from images.opencadc.org
or a different IVOA-compatible registry service.resourceIDs
minoc
service is available at the URL https://www.example.org/minoc
and you choose a resourceID of ivo://example.org/minoc
, the registry config for that resource (in the reg-resource-caps.properties
file for the registry service) would look like: ivo://example.org/minoc = https://www.example.org/minoc
This resourceID will appear in, for example, the minoc.properties
file in the minoc
service config: org.opencadc.minoc.resourceID = ivo://example.org/minoc
curl https://www.example.org/reg/resource-caps
baldur - Permission service
storage-inventory/baldur
image from images.opencadc.org
baldur.properties
:
org.opencadc.baldur.allowedUser
x509 DN specified here is generally a 'service' user -- the services that call baldur
need to be configured with this user's certificate.
minoc
and raven
coniguration, this is the cadcproxy.pem
file for these services.org.opencadc.baldur.allowedGroup
is an IVOA GMS group resourceID.
readOnlyGroup
and readWriteGroup
entries are also IVOA GMS group resourceIDs.curl https://www.example.org/baldur/availability
GMS - Group Membership service
minoc - File service
storage-inventory/minoc
image from images.opencadc.org
minoc.properties
:
org.opencadc.minoc.resourceID
:
org.opencadc.minoc.readGrantProvider
and org.opencadc.minoc.readGrantProvider
. It is possible to have multiple instances of these providers, by specifing the GrantProvider
options for each provider (each use of the option is additive to the previous ones).org.opencadc.minoc.publicKeyFile
:
org.opencadc.raven.publicKeyFile
.catalina.properties
(from cadc-tomcat config):
org.opencadc.minoc.inventory.username
database account is the 'Inventory admin user' configured when creating the databasecurl https://www.example.org/minoc/availability
luskan - Query service
storage-inventory/luskan
image from images.opencadc.org
luskan.properties
:
org.opencadc.luskan.isStorageSite
- for a storage site, this should be set to true
. The content of the inventory database is different between a storage site and a global site.org.opencadc.luskan.allowedGroup
is an IVOA GMS group resourceID.
catalina.properties
:
org.opencadc.luskan.uws.username
database account is generally the same as the 'TAP admin user' configured when creating the database.org.opencadc.luskan.tapadm.username
database account is the same 'TAP admin user'.org.opencadc.luskan.query.username
database account is the 'TAP query user' account.cadc-tap-tmp.properties
:
org.opencadc.tap.tmp.TempStorageManager.baseURL
is the URL for this luskan
service, plus a path where query results can be retrieved from.
https://www.example.org/luskan
, then this baseURL
could be https://www.example.org/luskan/results
/results
path will be mapped to the path in the container specified by org.opencadc.tap.tmp.TempStorageManager.baseStorageDir
. Ideally, this path will be a file-system that is shared among all luskan
instances for your site.
baseStorageDir = /tmpdata
in your configuration, the luskan will store query results here (e.g. /tmpdata/xyz.xml
) and that result will be retrievable as https://www.example.org/luskan/results/xyz.xml
.curl https://www.example.org/luskan/availability
raven - File location service
storage-inventory/raven
image from images.opencadc.org
raven.properties
org.opencadc.raven.publicKeyFile
and org.opencadc.raven.privateKeyFile
:
These are optional optimizations needed so that raven can generate 'pre-authorized' URLs for files, allowing the minocs that serve the file to skip this step before delivering the file. The authentication information is embedded in a specially encoded URL.
these are RSA public and private key files which can be generated using cadc-keygen or the commands below:
ssh-keygen -b 2048 -t rsa -m pkcs8 -f temp_rsa
ssh-keygen -e -m pkcs8 -f temp_rsa.pub > raven-public.key
mv temp_rsa raven-private.key
rm temp_rsa.pub
the publicKeyFile
will be required by services which need to verify the pre-authorized URLs (minoc
).
curl https://www.example.org/raven/availability
fenwick - Metadata sync application
storage-inventory/fenwick
image from images.opencadc.org
fenwick.properites
:
org.opencadc.fenwick.queryService
:
queryService
is the resourceID for the remote luskan
service -- ie. if fenwick is running at a Storage site, queryService
should refer to the remote Global site luskan
; if fenwick is running at the Global site, queryService
should refer to the remote Storage site luskan
service. A Global site will need to run a fenwick instance for each Storage site.tantar - File validation application
storage-inventory/tantar
image from images.opencadc.org
tantar.properties
org.opencadc.tantar.buckets
:
0-f
); for multiple instances of tantar, you would want to configure these to operate on non-overlapping subsets of buckets (e.g. 0-7
, 8-f
).critwall - File sync application
storage-inventory/critwall
image from images.opencadc.org
critwall.properties
org.opencadc.critwall.locatorService
:
raven
.org.opencadc.critwall.buckets
:
ratik - Metadata validation
storage-inventory/ratik
image from images.opencadc.org
ratik.properties
org.opencadc.ratik.queryService
:
queryService
is the resourceID for the remote luskan
service -- ie. if ratik is running at a Storage site, queryService
should refer to the remote Global site luskan
; if ratik is running at the Global site, queryService
should refer to the remote Storage site luskan
service. A Global site will need to run a ratik instance for each Storage site.org.opencadc.ratik.buckets
:
/availability
endpoint which can be used for both monitoring and healthchecks.
/availability?detail=min
returns no data, but the HTTP return code can be used for an efficient healthcheck.Additional FAQ can be found here
org.opencadc.fenwick.inventory.url
; the key for the database URL for the service minoc is org.opencadc.minoc.inventory.url
. It is easy to cut and paste between config files and forget to change the key.ratik
-- this will remove the artifacts from the site while ensuring that at least one other copy of the artifact exists elsewhere in your Storage Inventory system.
artifact-selector.sql
for ratik
and fenwick
at the site you wish to remove the artifacts from to exclude their namespace.
ratik
verifies the local site inventory against the global inventory. If the artifacts are not in the site's 'selector' list, it will remove them from the local site inventory database.fenwick
syncs artifact metadata from the global inventory. If the artifacts are not in the site's 'selector' list, it will not create new local artifacts for them.ratik
at that site. This will only remove the artifacts from the inventory.Artifact
table in the database.org.opencadc.tantar.purgeNamespace
for tantar
at the site to include the namespace of the artifacts being removed.tantar
at the site. This will remove the files from storage.artifact-selector.sql
for ratik
and fenwick
at the site you wish to remove the artifacts from to exclude their namespace.
ratik
or fenwick
.artifact-deselector.sql
for ringold
at the site to include the artifacts' namespace.
ringhold
will explicitly remove the artifacts identified by the 'deselector' WHERE clause from the inventory database.ringhold
.org.opencadc.tantar.purgeNamespace
for tantar
at the site to include the namespace of the artifacts being removed.tantar
at the site. This will remove the files from storage.org.opencadc.tantar.policy.ResolutionPolicy
determines whether the storage is definitive ('StorageIsAlwaysRight') or the inventory database is definitive ('InventoryIsAlwaysRight'). Usually, you will be using 'InventoryIsAlwaysRight'.reason
field in the log line gives the rationale for this decision:
reason=no-matching-artifact
- the file didn't match anything in the database, so should be deleted.reason=old-storageLocation
- the file didn't match the storage ID of the artifact, and the correct file with the matching storage ID was available.war-rename.conf
baldur
,luskan
,minoc
,raven
) expect to be available via a URL like https://example.org/minoc -- i.e. the service name immediately follows the domain name, with a path of '/', and the name of the service in the URL is unchanged.
war-rename.conf
in the service's /config
directory.
war-rename.conf
file is a simple Unix shell-like 'mv' command to rename the service war file, e.g. mv minoc.war newname.war - in this example, the minoc
service would be available as, e.g., https://example.org/newname.