artifact
A representation of a file and its metadata in the SI database, specifically the inventory.Artifact table (see the SI data model). The term artifact is often used to refer to both the database representation of a file and the file itself as one thing, but when an SI application is acting on an artifact, it specifically refers to the database.
bucket
SI assigns a bucket label to files on storage and to artifacts in the inventory database.
a74.04c are not necessarily the same as artifacts in uri bucket 04c.namespace
A namespace is an SI identifier for a collection of artifacts, and can be used to define the logical structure of data within SI. For example, cadc:CFHT/ might be used to identify all CFHT files held at the CADC; cadc:CFHT/raw/ might be used to identify all of the raw CFHT files held at the CADC -- both cadc:CFHT/ and cadc:CFHT/raw/ are namespaces but they identify different scopes of CFHT artifacts. SI services and applications often act on namespaces defined using regex patterns -- such as during file replication and determining access permissions -- so some thought must be put into what is used. See the SI data model page for more detail on the concepts of URI and namespace.
: or /.
cadc:TEST/data/raw/example/test1.fits:
cadc:cadc:TEST/cadc:TEST/data/cadc:TEST/data/raw/cadc:TEST/data/raw/example/cadc:test1.fits) is not recommended.
: is referred to as the 'scheme' elsewhere in the documentation. This part of the namespace usually refers to an organization (e.g. cadc:DAO, mast:JWST)resourceID
This is an unique ID for a deployed service. A registry service provides a look-up to translate these IDs into service URLs. Example: ivo://opencadc.org/minoc, which might resolve to https://www.opencadc.org/minoc.
resourceID is a label which identifies a service in an abstract way.
resourceIDs may be shared within a site -- used by local services and applications -- but may also be shared across an organization. Because the scope of a site's resourceIDs could be broad, some thought needs to be put into their definition.ivo: scheme in the resourceID means that the registry service that will be used to resolve the resourceID complies with the IVOA registry standard. The available CADC reg service is an example of an IVOA registry implementation.
https://images.opencadc.org/<project>/<image>:<tag>, where the list of projects, images, and tags can be found using the commands below. Images specific to Storage Inventory will be found in the storage-inventory project, although images for supporting services may be in other projects (i.e. in core).jq is a helpful command for parsing JSON):curl -s https://images.opencadc.org/api/v2.0/projects | jq '[.[].name] | sort'
curl -s https://images.opencadc.org/api/v2.0/projects/storage-inventory/repositories | jq '[.[].name] | sort'
curl -s https://images.opencadc.org/api/v2.0/projects/storage-inventory/repositories/minoc/artifacts | jq '[.[].tags | select (. != null) | .[].name] | sort'
0.9.0) or version-datetimestamp (e.g. 0.9.0-20230217T201656)x.y.z:
x will change with major releases -- functionality, api, and configuration may change and break earlier configuration.y will change with minor releases -- minor non-breaking or backwards-compatible feature or configuration changes.z will change with minor bug fixes -- otherwise compatible with current version features and configuration.Storage Inventory consists of the components that make up one or more Storage sites and a Global site. A Storage site can exist on its own, as a mechanism for maintaining a structured inventory of files. A Global site is required when there are two or more Storage sites which need to be synchronized; it also provides a single site for users to go to find all available copies of a file. A detailed description of the data model, features and limitations can be found here.
In general:
A Storage site maintains an inventory of the files stored at a particular location, and provides mechanisms to access (minoc) those files and query (luskan) the local inventory. Below is an outline of a stand-alone (no Global site) Storage Inventory Storage site, with one storage system, one database, etc, in one data centre. If you have files in multiple data centres, or more than one storage platform in one data centre (e.g. some files on a posix file-system and some on Ceph object storage), you would have more than one Storage site, and each site would run its own services, database, storage, and applications.
A standalone Storage Inventory Storage site will consist of following:
minoc): provides a REST based file service that supports HEAD, GET, PUT, POST, DELETE operations.luskan): provides a Web Service for querying artifact metadata contained in the Inventory database, using the IVOA Table Access Protocol (TODO: Replace TAP link with user document, not reference to spec....)tantar): Artifact validation application that compares the inventory database with the contents of the back-end storage.ringhold): Removes the local copy of artifacts. Use with caution: essentially the same as \rm -r on a namespace at a storage site.Registry service: Used to map resourceIDs to the actual URLs where the service is deployed. Client software, services, and applications will use a registry to look up the locations of services. The linked cadc-registry-server is provided as an example implementation.baldur): permissions service which uses configurable rules to grant access based on resource identifiers (Artifact.uri values or namespaces).
This service is required if Authentication and Authorization (A&A) is required for the SI deployment. Generally, baldur works along with a Group Membership Service (GMS) and/or User Service.GMS): . Needed for providing the IVAO group membership look-up API used by baldur and other services when determining access permissions. For an example implementation, built on top of Indigio IAM, see https://gitlab.com/ska-telescope/src/group-membership-service (more implementation details to follow).
If you need to replicate files among multiple Storage Sites, you will need a Global Site. The Global site maintains a view of all Storage sites, allowing individual Storage sites to discover files that they need to copy. This also provides a single site which users can query to find files, rather than having to know about and search individual Storage sites.
A Global site will be required different services than a Storage site, and both Storage sites and Global sites will need to run additional applications to synchronize metadata and files.
fenwick but a Global site will need to run an instance of fenwick for each Storage site it needs to track. See the Metadata synchronization description below.fenwick a Storage site will only need to run one instance of ratik but a Global site will run an instance of ratik for each Storage site it is tracking.raven service for the locations of those files, and download them to the local storage. See the File synchronization description below.
site1.minoc service, either directly or via negotiation with a global raven service.global.fenwick.site1 discovers the new inventory metadata for the file by querying site1.luskan.site2.fenwick.global discovers the new inventory metadata for the file by querying global.luskan.site2.critwall finds the locations of the new file via global.raven -- this returns a list of URLs from which the file can be downloaded.site2.critwall downloads the file from site1.minoc.Generic HTTP client tools such as curl or wget can be used to interact with the SI, however multi-step operations such as transfer negotiations or transfer of large files with
SI transactions might require dedicated scripts. (TODO provide examples of usage)
Alternatively, the CADC maintains Python client applications/libraries that can be used with the SI:
cadcdata - for file operations with minoc and raven. That includes transfer negotiations, file uploads, downloads or deletes. Or simply file information. The package takes advantage of
the SI features to offer robust and fault tollerant transfer of files small or large.cadctap - for querying the artifact metadata. It works with a luskan service delopyed at a site or the global one. Alternatively, any generic TAP-based tool can be used to query luskan including (but not limitted to) PyVOThe CADC Direct Data Service presents a variety of scenarios for accessing the CADC SI using generic and specific client tools.
Database:
Storage platform:
Required for: Storage site only
Ceph Object store (version 14 or greater)
OR
POSIX file-system.
Worker nodes:
haveged (or other entropy-generating service) this is only necessary on hosts running the services.Note on logging: Storage inventory services and application containers all log to stdout by default -- for a production deployment, these should be captured and preserved by whatever mechanism is available on your system.
for POSIX storage, the storage file-system will need to be mounted directly into the containers (e.g. a 'volume' path in Docker or a PVC in kubernetes). Since the storage will be mounted by several containers, it will need to be a shared file-system which supports writes from multiple hosts.
❗NOTE: the services in the containers run as a user with a UID:GID of 8675309:8675309. This user must be allowed to read and write files on the configured file-system. This is usually done by ensuring a non-privileged (or even 'nologin') user is configured on your system with this UID:GID.
in the cadc-storage-adapter-fs.properties configuration file for POSIX storage :
org.opencadc.inventory.storage.fs.baseDir parameter must point to the location that the storage is mounted inside the container. For example,docker run --user tomcat:tomcat -v /path/on/host:/mountpoint/in/container minoc:0.9.2
orapiVersion: app/v1
kind: deployment
<...snip....>
volumeMounts:
- mountPath: "/mountpoint/in/container"
name: lustre-volume
securityContext:
runAsUser: 8675309
runAsGroup: 8676309
volumes:
- name: lustre-ceph-volume
hostPath:
path: /path/on/host
type: Directory
org.opencadc.inventory.storage.fs.OpaqueFileSystemStorageAdapter.bucketLength sets the depth of the directory tree created to store files. At each node in the tree, 16 hex (0-f) directories are created -- a bucketLength of 2 will create 16 directories (0-f) each with sixteen subdirectories (0-f) -- only the 256 (16x16) subdirectories at the bottom of the tree will be used to store files. For efficiency of validation, you should choose a bucketLength which results in only a few thousand files in each directory. e.g, for a bucketLength=3 and baseDir = /mount/in/container:[container]$ ls -F /mount/in/container/
0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ a/ b/ c/ d/ e/ f/ # Depth=1/3
[container]$ ls -F /mount/in/container/a/
0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ a/ b/ c/ d/ e/ f/ # Depth=2/3
[container]$ ls -F /mount/in/container/a/7/
0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/ a/ b/ c/ d/ e/ f/ # Depth=3/3
[container]$ ls -F /mount/in/container/a/7/4/
test0001.fits test0002.fits test0003.fits test0004.fits test0005.fits
a74, and there will be a total of 4096 (16x16x16) buckets.cadc-storage-adapter-swift.properties configuration file for Swift storage:
org.opencadc.inventory.storage.swift.SwiftStorageAdapter.bucketLength sets the number of hex characters in the configured buckets (e.g. a74), and the total number of buckets (i.e. a bucketLength of 3 will create 16^3 (4096) buckets). Configure the bucketLength so the expected number of files per bucket is no more than a few thousand.In the following, the database being created is called si_db, but you can change that name as you see fit. Whatever you choose, it will need to be referenced in the service and application configuration.
initdb -D /var/lib/postgresql/data --encoding=UTF8 --lc-collate=C --lc-ctype=C-D), depending on your postgres installation and hardware layout.As the postgres user, create a file named si.dll with the linked content, edit as appropriate, and run psql -f si.dll -a
tapadm) - privileged user. Manages the tap schema with permissions to create, alter, and drop tables. Used by:
tapuser) - unprivileged user. Used by the luskan service to query the inventory database. Used by:
invadm) - privileged user. Manages the inventory schema with privileges to create, alter, and drop tables, and is also used to insert, update, and delete rows in the inventory tables. Used by:
openssl 1.0.2k or a compatible version. Newer versions of openssl do not support proxy certificates.OPENSSL_ALLOW_PROXY_CERTS=1 needs to be set in the proxy environment.core/reg image from images.opencadc.org or a different IVOA-compatible registry service.resourceIDs
minoc service is available at the URL https://www.example.org/minoc and you choose a resourceID of ivo://example.org/minoc, the registry config for that resource (in the reg-resource-caps.properties file for the registry service) would look like: ivo://example.org/minoc = https://www.example.org/minoc
This resourceID will appear in, for example, the minoc.properties file in the minoc service config: org.opencadc.minoc.resourceID = ivo://example.org/minoc
curl https://www.example.org/reg/resource-capsbaldur - Permission service
storage-inventory/baldur image from images.opencadc.orgbaldur.properties:
org.opencadc.baldur.allowedUser x509 DN specified here is generally a 'service' user -- the services that call baldur need to be configured with this user's certificate.
minoc and raven coniguration, this is the cadcproxy.pem file for these services.org.opencadc.baldur.allowedGroup is an IVOA GMS group resourceID.
readOnlyGroup and readWriteGroup entries are also IVOA GMS group resourceIDs.curl https://www.example.org/baldur/availabilityGMS - Group Membership service
minoc - File service
storage-inventory/minoc image from images.opencadc.orgminoc.properties:
org.opencadc.minoc.resourceID:
org.opencadc.minoc.readGrantProvider and org.opencadc.minoc.readGrantProvider. It is possible to have multiple instances of these providers, by specifing the GrantProvider options for each provider (each use of the option is additive to the previous ones).org.opencadc.minoc.publicKeyFile:
org.opencadc.raven.publicKeyFile.catalina.properties (from cadc-tomcat config):
org.opencadc.minoc.inventory.username database account is the 'Inventory admin user' configured when creating the databasecurl https://www.example.org/minoc/availabilityluskan - Query service
storage-inventory/luskan image from images.opencadc.orgluskan.properties:
org.opencadc.luskan.isStorageSite - for a storage site, this should be set to true. The content of the inventory database is different between a storage site and a global site.org.opencadc.luskan.allowedGroup is an IVOA GMS group resourceID.
catalina.properties:
org.opencadc.luskan.uws.username database account is generally the same as the 'TAP admin user' configured when creating the database.org.opencadc.luskan.tapadm.username database account is the same 'TAP admin user'.org.opencadc.luskan.query.username database account is the 'TAP query user' account.cadc-tap-tmp.properties:
org.opencadc.tap.tmp.TempStorageManager.baseURL is the URL for this luskan service, plus a path where query results can be retrieved from.
https://www.example.org/luskan, then this baseURL could be https://www.example.org/luskan/results/results path will be mapped to the path in the container specified by org.opencadc.tap.tmp.TempStorageManager.baseStorageDir. Ideally, this path will be a file-system that is shared among all luskan instances for your site.
baseStorageDir = /tmpdata in your configuration, the luskan will store query results here (e.g. /tmpdata/xyz.xml) and that result will be retrievable as https://www.example.org/luskan/results/xyz.xml.curl https://www.example.org/luskan/availabilityraven - File location service
storage-inventory/raven image from images.opencadc.orgraven.properties
org.opencadc.raven.publicKeyFile and org.opencadc.raven.privateKeyFile:
These are optional optimizations needed so that raven can generate 'pre-authorized' URLs for files, allowing the minocs that serve the file to skip this step before delivering the file. The authentication information is embedded in a specially encoded URL.
these are RSA public and private key files which can be generated using cadc-keygen or the commands below:
ssh-keygen -b 2048 -t rsa -m pkcs8 -f temp_rsa
ssh-keygen -e -m pkcs8 -f temp_rsa.pub > raven-public.key
mv temp_rsa raven-private.key
rm temp_rsa.pub
the publicKeyFile will be required by services which need to verify the pre-authorized URLs (minoc).
curl https://www.example.org/raven/availabilityfenwick - Metadata sync application
storage-inventory/fenwick image from images.opencadc.orgfenwick.properites:
org.opencadc.fenwick.queryService:
queryService is the resourceID for the remote luskan service -- ie. if fenwick is running at a Storage site, queryService should refer to the remote Global site luskan; if fenwick is running at the Global site, queryService should refer to the remote Storage site luskan service. A Global site will need to run a fenwick instance for each Storage site.tantar - File validation application
storage-inventory/tantar image from images.opencadc.orgtantar.properties
org.opencadc.tantar.buckets:
0-f); for multiple instances of tantar, you would want to configure these to operate on non-overlapping subsets of buckets (e.g. 0-7, 8-f).critwall - File sync application
storage-inventory/critwall image from images.opencadc.orgcritwall.properties
org.opencadc.critwall.locatorService:
raven.org.opencadc.critwall.buckets:
ratik - Metadata validation
storage-inventory/ratik image from images.opencadc.orgratik.properties
org.opencadc.ratik.queryService:
queryService is the resourceID for the remote luskan service -- ie. if ratik is running at a Storage site, queryService should refer to the remote Global site luskan; if ratik is running at the Global site, queryService should refer to the remote Storage site luskan service. A Global site will need to run a ratik instance for each Storage site.org.opencadc.ratik.buckets:
/availability endpoint which can be used for both monitoring and healthchecks.
/availability?detail=min returns no data, but the HTTP return code can be used for an efficient healthcheck.Additional FAQ can be found here
org.opencadc.fenwick.inventory.url; the key for the database URL for the service minoc is org.opencadc.minoc.inventory.url. It is easy to cut and paste between config files and forget to change the key.ratik -- this will remove the artifacts from the site while ensuring that at least one other copy of the artifact exists elsewhere in your Storage Inventory system.
artifact-selector.sql for ratik and fenwick at the site you wish to remove the artifacts from to exclude their namespace.
ratik verifies the local site inventory against the global inventory. If the artifacts are not in the site's 'selector' list, it will remove them from the local site inventory database.fenwick syncs artifact metadata from the global inventory. If the artifacts are not in the site's 'selector' list, it will not create new local artifacts for them.ratik at that site. This will only remove the artifacts from the inventory.Artifact table in the database.org.opencadc.tantar.purgeNamespace for tantar at the site to include the namespace of the artifacts being removed.tantar at the site. This will remove the files from storage.artifact-selector.sql for ratik and fenwick at the site you wish to remove the artifacts from to exclude their namespace.
ratik or fenwick.artifact-deselector.sql for ringold at the site to include the artifacts' namespace.
ringhold will explicitly remove the artifacts identified by the 'deselector' WHERE clause from the inventory database.ringhold.org.opencadc.tantar.purgeNamespace for tantar at the site to include the namespace of the artifacts being removed.tantar at the site. This will remove the files from storage.org.opencadc.tantar.policy.ResolutionPolicy determines whether the storage is definitive ('StorageIsAlwaysRight') or the inventory database is definitive ('InventoryIsAlwaysRight'). Usually, you will be using 'InventoryIsAlwaysRight'.reason field in the log line gives the rationale for this decision:
reason=no-matching-artifact - the file didn't match anything in the database, so should be deleted.reason=old-storageLocation - the file didn't match the storage ID of the artifact, and the correct file with the matching storage ID was available.war-rename.conf
baldur,luskan,minoc,raven) expect to be available via a URL like https://example.org/minoc -- i.e. the service name immediately follows the domain name, with a path of '/', and the name of the service in the URL is unchanged.
war-rename.conf in the service's /config directory.
war-rename.conf file is a simple Unix shell-like 'mv' command to rename the service war file, e.g. mv minoc.war newname.war - in this example, the minoc service would be available as, e.g., https://example.org/newname.