CANFAR Storage Systems¶
Master CANFAR's storage systems for efficient data management
π― What You'll Learn
By the end of this guide, you'll understand: - The different storage systems available on CANFAR - When and how to use each storage type for your research - Best practices for data management, transfer, and backup - How to optimize your workflow for performance and data safety
CANFAR provides multiple storage systems optimized for different stages of your research workflow. Understanding when and how to use each storage type is crucial for efficient data analysis and collaboration.
π Types Comparison¶
Storage | Mount Path | Speed | Persistence | Backup | Quota | Best For |
---|---|---|---|---|---|---|
ARC Projects | /arc/projects/group/ |
Fast SSD | β Permanent | β Daily snapshots | Project-based | Active research, shared data |
ARC Home | /arc/home/username/ |
Fast SSD | β Permanent | β Daily snapshots | 10GB default | Personal configs, keys |
Scratch | /scratch/ |
Fastest NVMe | β Wiped at session end | β No backup | Unlimited | Temporary processing |
VOSpace | vos:username/ |
Medium | β Permanent | β Geo-redundant | User/project based | Archives, public data |
πΊοΈ Storage Lifecycle Overview¶
CANFAR's storage architecture is designed around the astronomy research lifecycle:
graph LR
Archive["π¦ External Archives<br/>ALMA, HST, etc."]
Download["β¬οΈ Download"]
Scratch["β‘ Scratch Storage<br/>Fast processing"]
Process["π Data Processing"]
ARC["π ARC Projects<br/>Shared results"]
VOSpace["βοΈ VOSpace<br/>Long-term archive"]
Archive --> Download
Download --> Scratch
Scratch --> Process
Process --> ARC
ARC --> VOSpace
ARC -.-> |Backup| Process
π ARC Storage¶
ARC (Advanced Research Computing) storage provides high-performance, persistent storage for active research.
/arc/projects/groupname/
- Shared Research Storage¶
When to Use ARC Projects
- Raw and processed datasets
- Analysis scripts and notebooks
- Results and publications
- Shared team resources
- Collaborative workflows
π§ Features:
- Shared access - All group members can read/write
- Fast SSD storage - Optimized for data analysis
- Daily backups - 30-day retention policy
- ACL support - Fine-grained permission control
π Recommended Structure:
/arc/projects/myproject/
βββ data/
β βββ raw/ # Original observational data
β βββ processed/ # Calibrated/reduced data
β βββ catalogs/ # Reference catalogs
β βββ simulations/ # Synthetic datasets
βββ code/
β βββ pipelines/ # Data processing workflows
β βββ analysis/ # Analysis scripts
β βββ notebooks/ # Jupyter notebooks
β βββ tools/ # Custom utilities
βββ results/
β βββ plots/ # Figures and visualizations
β βββ tables/ # Output measurements
β βββ papers/ # Manuscripts and drafts
β βββ presentations/ # Conference materials
βββ docs/
βββ README.md # Project documentation
βββ data_notes.md # Dataset descriptions
βββ procedures.md # Analysis procedures
/arc/home/username/
- Personal Space¶
When to Use ARC Home
- Personal configuration files (
.bashrc
,.jupyter/
) - SSH keys and authentication credentials
- Personal scripts and utilities
- Small reference files
β οΈ Limitations:
- 10GB default quota (contact support for increases)
- Personal access only (not shared)
- Not suitable for large datasets
Managing ARC Storage¶
Check Usage and Quotas¶
# Check project storage usage
df -h /arc/projects/myproject
# Detailed usage breakdown
du -sh /arc/projects/myproject/*
# Check home directory usage
du -sh /arc/home/$USER/*
# Check available space
df -h /arc
Organizing Data¶
# Create organized directory structure
mkdir -p /arc/projects/myproject/{data/{raw,processed,catalogs},code,results,docs}
# Set group permissions for collaboration
chmod -R g+rw /arc/projects/myproject/
chmod g+s /arc/projects/myproject/ # Inherit group ownership
Backup and Recovery¶
ARC Backup
ARC storage is automatically backed up daily with a 30-day retention policy. You can also restore files from snapshots if needed.
# List available snapshots (if enabled)
ls /arc/projects/myproject/.snapshots/
# Restore from snapshot
cp /arc/projects/myproject/.snapshots/daily.2024-03-15/important_file.fits \
/arc/projects/myproject/restored_file.fits
β‘ Scratch Storage¶
Scratch provides the fastest storage available on CANFAR, but files are temporary.
Important: Scratch Storage Lifecycle
Scratch storage is wiped at the end of each session, not nightly as some older documentation stated. When your interactive session ends or your batch job completes, all files in /scratch/
are permanently deleted.
When to Use Scratch¶
β Excellent for: - Large intermediate files during processing - Temporary downloads before organizing in ARC - High I/O operations requiring maximum speed - Uncompressing large archives - Sorting and filtering large datasets
β Never use for: - Important results (will be lost!) - Files you need to keep between sessions - Shared data (only accessible within your session)
Scratch Lifecycle¶
sequenceDiagram
participant User
participant Session as Session/Job
participant Scratch as /scratch/
User->>Session: Start session
Session->>Scratch: Create empty /scratch/ directory
User->>Scratch: Work with temporary files
Note over Scratch: Fast NVMe storage
User->>Session: End session
Session->>Scratch: DELETE ALL FILES
Note over Scratch: Directory wiped clean
Scratch Best Practices¶
# 1. Download large files to scratch first
cd /scratch
wget https://archive.alma.cl/large_dataset.tar.gz
# 2. Process immediately
tar -xzf large_dataset.tar.gz
casa --nologger -c "process_data.py"
# 3. Save results to permanent storage
cp processed_results.fits /arc/projects/myproject/data/processed/
# 4. Clean up isn't necessary (done automatically)
# but good practice during long sessions
rm large_dataset.tar.gz intermediate_*.fits
Multi-step Processing Workflow¶
#!/bin/bash
# Example: ALMA data reduction workflow using scratch
# Step 1: Download to scratch
cd /scratch
almaget 2019.1.00123.S
# Step 2: Process with CASA
casa --nologger --agg -c """
# CASA script here
execfile('/arc/projects/myproject/code/reduction_script.py')
"""
# Step 3: Save important results
mkdir -p /arc/projects/myproject/data/2019.1.00123.S/
cp *.image.fits /arc/projects/myproject/data/2019.1.00123.S/
cp *.uvfits /arc/projects/myproject/data/2019.1.00123.S/
# Step 4: Create processing log
echo "Processed $(date): 2019.1.00123.S" >> /arc/projects/myproject/processing_log.txt
βοΈ VOSpace¶
VOSpace provides web-accessible, long-term archive storage based on IVOA standards.
When to Use VOSpace
- Archives and public data
- Long-term preservation
- Sharing data with external collaborators
- Metadata-rich datasets
π§ Features:
- Web-based access - Upload/download via browser or command line
- Metadata support - Store astronomical metadata with files
- Version control - Track changes to datasets
- Sharing controls - Fine-grained access permissions
- Geographic redundancy - Multiple backup locations
β οΈ Considerations:
- Slower access than ARC storage (network-based)
- Better for archives than active analysis
- Command-line tools required for advanced features
VOSpace vs ARC Comparison¶
Use Case | VOSpace | ARC Projects |
---|---|---|
Active analysis | β Too slow | β Optimized |
Data sharing | β Web interface | β οΈ Requires group membership |
Public releases | β Public URLs | β Access controlled |
Long-term preservation | β Geo-redundant | β Daily backups |
Large file processing | β Network overhead | β Direct access |
Using VOSpace¶
Web Interface¶
Access VOSpace through the CANFAR portal: π VOSpace File Manager
Command Line Tools¶
# Install VOSpace tools
pip install vostools
# List VOSpace contents
vls vos:myproject
# Upload file
vcp local_file.fits vos:myproject/
# Download file
vcp vos:myproject/data.fits ./
# Create directory
vmkdir vos:myproject/results
# Set permissions
vchmod o+r vos:myproject/public_data.fits # Make publicly readable
VOSpace Python API¶
import vos
# Create client
client = vos.Client()
# Upload file with metadata
client.copy("local_file.fits", "vos:myproject/survey_data.fits")
# Set metadata
node = client.get_node("vos:myproject/survey_data.fits")
node.props["TELESCOPE"] = "ALMA"
node.props["OBJECT"] = "NGC1365"
client.update(node)
# Download with progress
client.copy("vos:myproject/large_file.fits", "local_copy.fits", send_md5=True)
π Data Transfers¶
Data Transfer Best Practice
Always move important results from /scratch/
to /arc/projects/
or VOSpace before ending your session. Use the right tool for your file size and workflow.
Transfer Strategies by Data Size¶
Small Files (<1GB)¶
# Direct copy (fastest for small files)
cp /scratch/result.fits /arc/projects/myproject/results/
# VOSpace upload
vcp /arc/projects/myproject/final_catalog.fits vos:myproject/
Medium Files (1-100GB)¶
# Use rsync for reliability
rsync -av --progress /scratch/large_dataset/ /arc/projects/myproject/data/
# VOSpace with compression
vcp --enable-compression /arc/projects/myproject/datacube.fits vos:myproject/
Large Files (>100GB)¶
# Process in chunks
for file in /scratch/survey_*.fits; do
# Process individual file
process_file.py "$file"
# Save results immediately
cp "${file%.fits}_processed.fits" /arc/projects/myproject/processed/
# Remove processed input to save space
rm "$file"
done
SSHFS Setup for External Access¶
Mount CANFAR storage on your local computer:
# Install SSHFS (macOS with Homebrew)
brew install --cask macfuse
brew install sshfs
# Create mount point
mkdir ~/canfar
# Mount ARC storage
sshfs username@ws-uv.canfar.net:/arc/projects/myproject ~/canfar
# Work with files locally
ls ~/canfar
cp ~/local_analysis.py ~/canfar/code/
# Unmount when done
umount ~/canfar
SSHFS on Different Platforms¶
π οΈ Advanced Storage Operations¶
Full VOSpace API Usage¶
Metadata Management¶
# Set custom metadata
vattr vos:myproject/observation.fits TELESCOPE ALMA
vattr vos:myproject/observation.fits OBJECT "NGC 1365"
vattr vos:myproject/observation.fits DATE-OBS "2024-03-15"
# View metadata
vattr vos:myproject/observation.fits
Advanced Permissions¶
# Make file publicly readable
vchmod o+r vos:myproject/public_catalog.fits
# Grant read access to specific group
vchmod g+r:external-collaborators vos:myproject/shared_data.fits
# Set up public directory
vmkdir vos:myproject/public
vchmod o+r vos:myproject/public
Cutout Services¶
Access subsections of large files without downloading the entire dataset:
import requests
# Get cutout from FITS file in VOSpace
cutout_url = "https://ws-cadc.canfar.net/vospace/data/myproject/large_image.fits"
params = {"cutout": "[1:100,1:100]", "format": "fits"} # Section to extract
response = requests.get(cutout_url, params=params)
with open("cutout.fits", "wb") as f:
f.write(response.content)
Automated Data Workflows¶
#!/usr/bin/env python
"""
Automated data processing workflow using multiple storage systems
"""
import os
import shutil
import vos
from pathlib import Path
def process_dataset(dataset_id):
"""Process a dataset using optimal storage strategy"""
# 1. Download to scratch for fast processing
scratch_dir = Path(f"/scratch/{dataset_id}")
scratch_dir.mkdir(exist_ok=True)
# Download from VOSpace to scratch
client = vos.Client()
client.copy(f"vos:archive/{dataset_id}.fits", str(scratch_dir / "raw_data.fits"))
# 2. Process data (using scratch for speed)
os.chdir(scratch_dir)
# ... processing code here ...
# 3. Save results to ARC projects
results_dir = Path(f"/arc/projects/myproject/results/{dataset_id}")
results_dir.mkdir(parents=True, exist_ok=True)
# Copy important results
shutil.copy("processed_image.fits", results_dir)
shutil.copy("measurements.csv", results_dir)
shutil.copy("processing.log", results_dir)
# 4. Archive final products to VOSpace
client.copy(
str(results_dir / "processed_image.fits"),
f"vos:myproject/processed/{dataset_id}_final.fits",
)
# Scratch cleanup happens automatically at session end
print(f"Processed {dataset_id} successfully")
# Process multiple datasets
for dataset in ["obs001", "obs002", "obs003"]:
process_dataset(dataset)
π What's Next?¶
Now that you understand CANFAR's storage systems:
- VOSpace API Guide β - Advanced programmatic access and detailed transfer methods
- Interactive Sessions β - Access storage from sessions
- Batch Jobs β - Automated storage workflows
- Container Guide β - Storage access in containers
Storage Strategy Summary
Golden Rule: Use /scratch/
for fast temporary work, save everything important to /arc/projects/
, and archive final results in VOSpace. Plan your data workflow around these three storage tiers for optimal performance and data safety.
Created: 2025-08-07