Skip to content

CANFAR Storage Systems

Master CANFAR's storage systems for efficient data management

🎯 What You'll Learn

By the end of this guide, you'll understand: - The different storage systems available on CANFAR - When and how to use each storage type for your research - Best practices for data management, transfer, and backup - How to optimize your workflow for performance and data safety

CANFAR provides multiple storage systems optimized for different stages of your research workflow. Understanding when and how to use each storage type is crucial for efficient data analysis and collaboration.

πŸ“Š Types Comparison

Storage Mount Path Speed Persistence Backup Quota Best For
ARC Projects /arc/projects/group/ Fast SSD βœ… Permanent βœ… Daily snapshots Project-based Active research, shared data
ARC Home /arc/home/username/ Fast SSD βœ… Permanent βœ… Daily snapshots 10GB default Personal configs, keys
Scratch /scratch/ Fastest NVMe ❌ Wiped at session end ❌ No backup Unlimited Temporary processing
VOSpace vos:username/ Medium βœ… Permanent βœ… Geo-redundant User/project based Archives, public data

πŸ—ΊοΈ Storage Lifecycle Overview

CANFAR's storage architecture is designed around the astronomy research lifecycle:

graph LR
    Archive["πŸ“¦ External Archives<br/>ALMA, HST, etc."] 
    Download["⬇️ Download"]
    Scratch["⚑ Scratch Storage<br/>Fast processing"]
    Process["πŸ”„ Data Processing"]
    ARC["πŸ“ ARC Projects<br/>Shared results"]
    VOSpace["☁️ VOSpace<br/>Long-term archive"]

    Archive --> Download
    Download --> Scratch
    Scratch --> Process
    Process --> ARC
    ARC --> VOSpace
    ARC -.-> |Backup| Process

πŸ“ ARC Storage

ARC (Advanced Research Computing) storage provides high-performance, persistent storage for active research.

/arc/projects/groupname/ - Shared Research Storage

When to Use ARC Projects

  • Raw and processed datasets
  • Analysis scripts and notebooks
  • Results and publications
  • Shared team resources
  • Collaborative workflows

πŸ”§ Features:

  • Shared access - All group members can read/write
  • Fast SSD storage - Optimized for data analysis
  • Daily backups - 30-day retention policy
  • ACL support - Fine-grained permission control

πŸ“‚ Recommended Structure:

/arc/projects/myproject/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/              # Original observational data
β”‚   β”œβ”€β”€ processed/        # Calibrated/reduced data  
β”‚   β”œβ”€β”€ catalogs/         # Reference catalogs
β”‚   └── simulations/      # Synthetic datasets
β”œβ”€β”€ code/
β”‚   β”œβ”€β”€ pipelines/        # Data processing workflows
β”‚   β”œβ”€β”€ analysis/         # Analysis scripts
β”‚   β”œβ”€β”€ notebooks/        # Jupyter notebooks
β”‚   └── tools/            # Custom utilities
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ plots/            # Figures and visualizations
β”‚   β”œβ”€β”€ tables/           # Output measurements
β”‚   β”œβ”€β”€ papers/           # Manuscripts and drafts
β”‚   └── presentations/    # Conference materials
└── docs/
    β”œβ”€β”€ README.md         # Project documentation
    β”œβ”€β”€ data_notes.md     # Dataset descriptions
    └── procedures.md     # Analysis procedures

/arc/home/username/ - Personal Space

When to Use ARC Home

  • Personal configuration files (.bashrc, .jupyter/)
  • SSH keys and authentication credentials
  • Personal scripts and utilities
  • Small reference files

⚠️ Limitations:

  • 10GB default quota (contact support for increases)
  • Personal access only (not shared)
  • Not suitable for large datasets

Managing ARC Storage

Check Usage and Quotas

# Check project storage usage
df -h /arc/projects/myproject

# Detailed usage breakdown
du -sh /arc/projects/myproject/*

# Check home directory usage
du -sh /arc/home/$USER/*

# Check available space
df -h /arc

Organizing Data

# Create organized directory structure
mkdir -p /arc/projects/myproject/{data/{raw,processed,catalogs},code,results,docs}

# Set group permissions for collaboration
chmod -R g+rw /arc/projects/myproject/
chmod g+s /arc/projects/myproject/  # Inherit group ownership

Backup and Recovery

ARC Backup

ARC storage is automatically backed up daily with a 30-day retention policy. You can also restore files from snapshots if needed.

# List available snapshots (if enabled)
ls /arc/projects/myproject/.snapshots/

# Restore from snapshot
cp /arc/projects/myproject/.snapshots/daily.2024-03-15/important_file.fits \
   /arc/projects/myproject/restored_file.fits

⚑ Scratch Storage

Scratch provides the fastest storage available on CANFAR, but files are temporary.

Important: Scratch Storage Lifecycle

Scratch storage is wiped at the end of each session, not nightly as some older documentation stated. When your interactive session ends or your batch job completes, all files in /scratch/ are permanently deleted.

When to Use Scratch

βœ… Excellent for: - Large intermediate files during processing - Temporary downloads before organizing in ARC - High I/O operations requiring maximum speed - Uncompressing large archives - Sorting and filtering large datasets

❌ Never use for: - Important results (will be lost!) - Files you need to keep between sessions - Shared data (only accessible within your session)

Scratch Lifecycle

sequenceDiagram
    participant User
    participant Session as Session/Job
    participant Scratch as /scratch/

    User->>Session: Start session
    Session->>Scratch: Create empty /scratch/ directory
    User->>Scratch: Work with temporary files
    Note over Scratch: Fast NVMe storage
    User->>Session: End session
    Session->>Scratch: DELETE ALL FILES
    Note over Scratch: Directory wiped clean

Scratch Best Practices

# 1. Download large files to scratch first
cd /scratch
wget https://archive.alma.cl/large_dataset.tar.gz

# 2. Process immediately
tar -xzf large_dataset.tar.gz
casa --nologger -c "process_data.py"

# 3. Save results to permanent storage
cp processed_results.fits /arc/projects/myproject/data/processed/

# 4. Clean up isn't necessary (done automatically)
# but good practice during long sessions
rm large_dataset.tar.gz intermediate_*.fits

Multi-step Processing Workflow

#!/bin/bash
# Example: ALMA data reduction workflow using scratch

# Step 1: Download to scratch
cd /scratch
almaget 2019.1.00123.S

# Step 2: Process with CASA
casa --nologger --agg -c """
# CASA script here
execfile('/arc/projects/myproject/code/reduction_script.py')
"""

# Step 3: Save important results
mkdir -p /arc/projects/myproject/data/2019.1.00123.S/
cp *.image.fits /arc/projects/myproject/data/2019.1.00123.S/
cp *.uvfits /arc/projects/myproject/data/2019.1.00123.S/

# Step 4: Create processing log
echo "Processed $(date): 2019.1.00123.S" >> /arc/projects/myproject/processing_log.txt

☁️ VOSpace

VOSpace provides web-accessible, long-term archive storage based on IVOA standards.

When to Use VOSpace

  • Archives and public data
  • Long-term preservation
  • Sharing data with external collaborators
  • Metadata-rich datasets

πŸ”§ Features:

  • Web-based access - Upload/download via browser or command line
  • Metadata support - Store astronomical metadata with files
  • Version control - Track changes to datasets
  • Sharing controls - Fine-grained access permissions
  • Geographic redundancy - Multiple backup locations

⚠️ Considerations:

  • Slower access than ARC storage (network-based)
  • Better for archives than active analysis
  • Command-line tools required for advanced features

VOSpace vs ARC Comparison

Use Case VOSpace ARC Projects
Active analysis ❌ Too slow βœ… Optimized
Data sharing βœ… Web interface ⚠️ Requires group membership
Public releases βœ… Public URLs ❌ Access controlled
Long-term preservation βœ… Geo-redundant βœ… Daily backups
Large file processing ❌ Network overhead βœ… Direct access

Using VOSpace

Web Interface

Access VOSpace through the CANFAR portal: πŸ”— VOSpace File Manager

Command Line Tools

# Install VOSpace tools
pip install vostools

# List VOSpace contents
vls vos:myproject

# Upload file
vcp local_file.fits vos:myproject/

# Download file  
vcp vos:myproject/data.fits ./

# Create directory
vmkdir vos:myproject/results

# Set permissions
vchmod o+r vos:myproject/public_data.fits  # Make publicly readable

VOSpace Python API

import vos

# Create client
client = vos.Client()

# Upload file with metadata
client.copy("local_file.fits", "vos:myproject/survey_data.fits")

# Set metadata
node = client.get_node("vos:myproject/survey_data.fits")
node.props["TELESCOPE"] = "ALMA"
node.props["OBJECT"] = "NGC1365"
client.update(node)

# Download with progress
client.copy("vos:myproject/large_file.fits", "local_copy.fits", send_md5=True)

πŸ”„ Data Transfers

Data Transfer Best Practice

Always move important results from /scratch/ to /arc/projects/ or VOSpace before ending your session. Use the right tool for your file size and workflow.

Transfer Strategies by Data Size

Small Files (<1GB)

# Direct copy (fastest for small files)
cp /scratch/result.fits /arc/projects/myproject/results/

# VOSpace upload
vcp /arc/projects/myproject/final_catalog.fits vos:myproject/

Medium Files (1-100GB)

# Use rsync for reliability
rsync -av --progress /scratch/large_dataset/ /arc/projects/myproject/data/

# VOSpace with compression
vcp --enable-compression /arc/projects/myproject/datacube.fits vos:myproject/

Large Files (>100GB)

# Process in chunks
for file in /scratch/survey_*.fits; do
    # Process individual file
    process_file.py "$file" 
    # Save results immediately
    cp "${file%.fits}_processed.fits" /arc/projects/myproject/processed/
    # Remove processed input to save space
    rm "$file"
done

SSHFS Setup for External Access

Mount CANFAR storage on your local computer:

# Install SSHFS (macOS with Homebrew)
brew install --cask macfuse
brew install sshfs

# Create mount point
mkdir ~/canfar

# Mount ARC storage
sshfs username@ws-uv.canfar.net:/arc/projects/myproject ~/canfar

# Work with files locally
ls ~/canfar
cp ~/local_analysis.py ~/canfar/code/

# Unmount when done
umount ~/canfar

SSHFS on Different Platforms

# Install dependencies
brew install --cask macfuse
brew install sshfs

# Mount
sshfs username@ws-uv.canfar.net:/arc/projects/myproject ~/canfar
# Install SSHFS
sudo apt install sshfs  # Ubuntu/Debian
# or
sudo yum install sshfs  # CentOS/RHEL

# Mount
sshfs username@ws-uv.canfar.net:/arc/projects/myproject ~/canfar
# Install WinFsp and SSHFS-Win
# Download from: https://github.com/billziss-gh/sshfs-win

# Mount using SSHFS-Win GUI or command line

πŸ› οΈ Advanced Storage Operations

Full VOSpace API Usage

Metadata Management

# Set custom metadata
vattr vos:myproject/observation.fits TELESCOPE ALMA
vattr vos:myproject/observation.fits OBJECT "NGC 1365"
vattr vos:myproject/observation.fits DATE-OBS "2024-03-15"

# View metadata
vattr vos:myproject/observation.fits

Advanced Permissions

# Make file publicly readable
vchmod o+r vos:myproject/public_catalog.fits

# Grant read access to specific group
vchmod g+r:external-collaborators vos:myproject/shared_data.fits

# Set up public directory
vmkdir vos:myproject/public
vchmod o+r vos:myproject/public

Cutout Services

Access subsections of large files without downloading the entire dataset:

import requests

# Get cutout from FITS file in VOSpace
cutout_url = "https://ws-cadc.canfar.net/vospace/data/myproject/large_image.fits"
params = {"cutout": "[1:100,1:100]", "format": "fits"}  # Section to extract

response = requests.get(cutout_url, params=params)
with open("cutout.fits", "wb") as f:
    f.write(response.content)

Automated Data Workflows

#!/usr/bin/env python
"""
Automated data processing workflow using multiple storage systems
"""
import os
import shutil
import vos
from pathlib import Path


def process_dataset(dataset_id):
    """Process a dataset using optimal storage strategy"""

    # 1. Download to scratch for fast processing
    scratch_dir = Path(f"/scratch/{dataset_id}")
    scratch_dir.mkdir(exist_ok=True)

    # Download from VOSpace to scratch
    client = vos.Client()
    client.copy(f"vos:archive/{dataset_id}.fits", str(scratch_dir / "raw_data.fits"))

    # 2. Process data (using scratch for speed)
    os.chdir(scratch_dir)
    # ... processing code here ...

    # 3. Save results to ARC projects
    results_dir = Path(f"/arc/projects/myproject/results/{dataset_id}")
    results_dir.mkdir(parents=True, exist_ok=True)

    # Copy important results
    shutil.copy("processed_image.fits", results_dir)
    shutil.copy("measurements.csv", results_dir)
    shutil.copy("processing.log", results_dir)

    # 4. Archive final products to VOSpace
    client.copy(
        str(results_dir / "processed_image.fits"),
        f"vos:myproject/processed/{dataset_id}_final.fits",
    )

    # Scratch cleanup happens automatically at session end
    print(f"Processed {dataset_id} successfully")


# Process multiple datasets
for dataset in ["obs001", "obs002", "obs003"]:
    process_dataset(dataset)

πŸ”— What's Next?

Now that you understand CANFAR's storage systems:


Storage Strategy Summary

Golden Rule: Use /scratch/ for fast temporary work, save everything important to /arc/projects/, and archive final results in VOSpace. Plan your data workflow around these three storage tiers for optimal performance and data safety.


Last update: 2025-08-07
Created: 2025-08-07