Data Transfers¶

Moving data between CANFAR storage systems, external sources, and your local computer.

🎯 Transfer Methods Overview

Efficient data movement strategies:

Web Interfaces: Simple uploads and downloads for small files
Command-Line Tools: Efficient transfers for large datasets
Automated Workflows: Scripted transfers and synchronisation
Performance Optimisation: Choosing the right method for your data size

Efficient data transfer is essential for astronomy workflows. CANFAR provides multiple transfer methods optimised for different scenarios, from small file uploads to large dataset synchronisation.

🔄 Transfer Overview¶

Transfer Types by Method¶

Method	Best For	Speed	Complexity	Interactive	Automated
Web Upload/Download	Small files (<1GB)	Slow	Simple	✅	❌
Direct URLs	Medium files, scripting	Medium	Simple	⚠️	✅
VOSpace CLI	All sizes, Vault access	Medium	Medium	✅	✅
SSHFS Mount	Local file operations	Medium	Medium	✅	⚠️
rsync via SSHFS	Large datasets, sync	Fast	Advanced	⚠️	✅

Storage System Access¶

Source → Destination	Method	Command Example
Local → ARC Projects	SSHFS, Direct URL, VOSpace	`vcp file.fits vos:/arc:projects/[project]/`
Local → Vault	VOSpace CLI, Web	`vcp file.fits vos:[user]/`
Local → Scratch	Only during sessions	`cp file.fits /scratch/` (within session)
ARC → Vault	VOSpace CLI	`vcp /arc/projects/[project]/file.fits vos:[user]/`
Vault → ARC	VOSpace CLI	`vcp vos:[user]/file.fits /arc/projects/[project]/`
Scratch ↔ ARC	Direct copy	`cp /scratch/file.fits /arc/projects/[project]/`

📤 Upload Methods¶

Small Files (<1GB): Web Interface¶

ARC Projects and Home¶

Navigate to storage: ARC File Manager
Select destination: Choose your home or project directory
Upload files: Click "Add" → "Upload Files"
Select files: Choose files from your computer
Confirm upload: Click "Upload" then "OK"

Note on a notebook session you can also use the JupyterLab Upload button.

Vault (VOSpace)¶

Navigate to Vault: VOSpace File Manager
Select destination: Browse to your space
Upload files: Same process as ARC storage
Set permissions: Right-click → Properties to set sharing permissions

Medium Files (1-100GB): Command Line¶

Using Direct URLs (ARC only)¶

# Authenticate first
cadc-get-cert -u [user]

# Upload to ARC Home
curl -E ~/.ssl/cadcproxy.pem \
     -T myfile.fits \
     https://ws-uv.canfar.net/arc/files/home/[user]/myfile.fits

# Upload to ARC Projects  
curl -E ~/.ssl/cadcproxy.pem \
     -T myfile.fits \
     https://ws-uv.canfar.net/arc/files/projects/[project]/myfile.fits

Using VOSpace CLI¶

# Install VOS tools (if not already available)
pip install vos

# Authenticate
cadc-get-cert -u [user]

# Upload to Vault
vcp myfile.fits vos:[user]/data/

# Upload to ARC via VOSpace API
vcp myfile.fits arc:projects/[project]/data/

# Upload with progress monitoring
vcp --verbose myfile.fits vos:[user]/large_files/

Large Files (>100GB): Advanced Methods¶

SSHFS Mount + rsync¶

# 1. Mount CANFAR storage locally
mkdir ~/canfar_mount
sshfs -p 64022 [user]@ws-uv.canfar.net:/ ~/canfar_mount

# 2. Sync large datasets with rsync
rsync -avzP --partial \
      ./large_dataset/ \
      ~/canfar_mount/arc/projects/[project]/data/

# 3. Unmount when complete
umount ~/canfar_mount

VOSpace Bulk Transfer¶

# Sync entire directories
vsync ./local_data/ vos:[user]/backup/

# Parallel transfers (faster for many files)
vcp --nstreams=4 large_file.tar vos:[user]/archives/

📥 Download Methods¶

From ARC Storage¶

Web Interface¶

Navigate: ARC File Manager
Select files: Check boxes next to desired files
Download options:
ZIP: Single archive (recommended for multiple files)
URL List: Generate download links for scripting
HTML List: Individual download links

Command Line¶

# Direct URL download
curl -E ~/.ssl/cadcproxy.pem \
     https://ws-uv.canfar.net/arc/files/home/[user]/myfile.fits \
     -o myfile.fits

# Via VOSpace API
vcp arc:home/[user]/myfile.fits ./

# Multiple files with wildcards
vcp "arc:projects/[project]/data/*.fits" ./local_data/

From Vault (VOSpace)¶

Command Line¶

# Single file
vcp vos:[user]/data.fits ./

# Directory with all contents
vcp vos:[user]/survey_data/ ./local_survey/

Python API¶

import vos

client = vos.Client()

# Download single file
client.copy("vos:[user]/data.fits", "./local_data.fits")

# Download with progress callback
def progress_callback(bytes_transferred, total_bytes):
    percent = (bytes_transferred / total_bytes) * 100
    print(f"Progress: {percent:.1f}%")

client.copy("vos:[user]/large_file.fits", 
           "./large_file.fits", 
           callback=progress_callback)

🔄 Inter-Storage Transfers¶

Moving Data Between Storage Systems¶

Scratch to ARC (Within Sessions)¶

# Process data in scratch for speed
cp /arc/projects/[project]/raw_data.fits /scratch/
python reduce_data.py /scratch/raw_data.fits

# Save results to permanent storage
cp /scratch/processed_data.fits /arc/projects/[project]/results/
cp /scratch/analysis_plots/ /arc/projects/[project]/figures/

ARC to Vault (Archival)¶

# Archive completed project results
vcp /arc/projects/[project]/final_results/ vos:[user]/archives/project2024/

Vault to ARC (Project Setup)¶

# Import archived data for new analysis
vcp vos:shared_project/calibrated_data/ /arc/projects/[project]/data/

# Import specific datasets
vcp "vos:public_surveys/gaia_dr3/*.fits" /arc/projects/[project]/catalogues/

Automated Workflow Example¶

#!/bin/bash
# Complete data processing workflow

set -e  # Exit on error

PROJECT_DIR="/arc/projects/[project]"
SCRATCH_DIR="/scratch"

echo "Starting data processing pipeline..."

# 1. Download raw data from Vault to scratch
echo "Downloading raw data..."
vcp vos:[user]/raw_observations/obs_*.fits ${SCRATCH_DIR}/

# 2. Process data in scratch (fastest storage)
echo "Processing data..."
cd ${SCRATCH_DIR}
for file in obs_*.fits; do
    python calibrate.py "$file" "cal_${file}"
done

# 3. Save intermediate results to ARC
echo "Saving calibrated data..."
mkdir -p ${PROJECT_DIR}/calibrated/
cp cal_*.fits ${PROJECT_DIR}/calibrated/

# 4. Further analysis
echo "Running analysis..."
python analyze_all.py ${PROJECT_DIR}/calibrated/ > analysis_results.txt

# 5. Save final results to ARC and archive to Vault
echo "Saving final results..."
cp analysis_results.txt ${PROJECT_DIR}/results/
cp final_plots/*.png ${PROJECT_DIR}/figures/

# Archive to Vault
vcp ${PROJECT_DIR}/results/ vos:[user]/completed_projects/$(date +%Y%m%d)/

echo "Pipeline completed successfully!"

📊 Performance Optimization¶

Transfer Speed Optimization¶

For Many Small Files¶

# Bundle small files into archives
tar -czf analysis_scripts.tar.gz scripts/
vcp analysis_scripts.tar.gz vos:[user]/code/

# Use directory sync instead of individual copies
vsync --nstreams=4 ./many_small_files/ vos:[user]/collection/

Network Performance Tips¶

Optimal Transfer Times¶

Best performance: Off-peak hours (evenings, weekends)
Avoid: Peak research hours (9 AM - 5 PM Pacific)

Connection Optimization¶

# Check network speed to CANFAR
ping ws-uv.canfar.net

# Test transfer speed with small file
time vcp test_file.fits vos:[user]/speed_test/

🚨 Error Handling and Recovery¶

Common Transfer Issues¶

Authentication Errors¶

# Certificate expired
ERROR:: Expired cert. Update by running cadc-get-cert

# Solution: Refresh certificate
cadc-get-cert -u [user]

# Check certificate validity
cadc-get-cert --days-valid

Network Timeouts¶

# Retry with exponential backoff
for i in {1..3}; do
    vcp file.fits vos:[user]/ && break
    sleep $((2**i))
done

Robust Transfer Script¶

#!/usr/bin/env python
"""
Robust file transfer with retry logic
"""
import vos
import time
import sys
from pathlib import Path

def robust_transfer(source, destination, max_retries=3):
    """Transfer file with retry logic"""
    client = vos.Client()

    for attempt in range(max_retries):
        try:
            print(f"Transfer attempt {attempt + 1}: {source} → {destination}")
            client.copy(source, destination)
            print(f"✓ Transfer successful")
            return True

        except Exception as e:
            print(f"✗ Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Waiting {wait_time} seconds before retry...")
                time.sleep(wait_time)
            else:
                print(f"Transfer failed after {max_retries} attempts")
                return False

# Usage
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python robust_transfer.py <source> <destination>")
        sys.exit(1)

    source, destination = sys.argv[1], sys.argv[2]
    success = robust_transfer(source, destination)
    sys.exit(0 if success else 1)

📋 Transfer Checklists¶

Pre-Transfer Checklist¶

Authentication: Valid CADC certificate (cadc-get-cert)
Permissions: Write access to destination directory
Space: Sufficient quota in destination storage
Network: Stable connection for large transfers
Backup: Important data backed up before moving

Post-Transfer Verification¶

# Verify file integrity
vls -l vos:[user]/transferred_file.fits  # Check size and timestamp

# Compare checksums (if available)
vcp --head vos:[user]/data.fits | grep MD5

# Test file readability
python -c "from astropy.io import fits; fits.open('test_file.fits')"

Transfer Planning Template¶

## Transfer Plan: [Project Name]

**Data Description**: 
- Size: ___GB
- File count: ___
- Type: Raw/Processed/Results

**Source**: _______________
**Destination**: ___________
**Method**: _______________

**Timeline**:
- Start: ____________
- Estimated completion: ___________

**Verification**:
- [ ] File count matches
- [ ] Total size matches  
- [ ] Sample files readable
- [ ] Permissions set correctly

**Backup**: _______________

🔗 Integration Examples¶

Jupyter Notebook Upload¶

Within a CANFAR Jupyter session:

# Upload files using the Jupyter interface
# 1. Click the "Upload" button in file browser
# 2. Select files from your computer
# 3. Files appear in current directory

# Move uploaded files to appropriate storage
import shutil
shutil.move('uploaded_data.fits', '/arc/projects/[project]/data/')

# Or copy to scratch for processing
shutil.copy('/arc/projects/[project]/data.fits', '/scratch/')

Batch Job Data Staging¶

#!/bin/bash
# Batch job with data staging

# Download input data
vcp vos:project/input_data.tar.gz /scratch/
cd /scratch
tar -xzf input_data.tar.gz

# Process data
python analysis.py input_data/

# Upload results
tar -czf results_$(date +%Y%m%d).tar.gz results/
vcp results_*.tar.gz vos:[user]/job_outputs/

# Cleanup
rm -rf /scratch/*

External Data Import¶

# Download from astronomical archives
wget -O survey_data.fits "https://archive.eso.org/..."

# Upload to CANFAR
vcp survey_data.fits vos:[user]/external_data/

# Or direct to project space
curl -E ~/.ssl/cadcproxy.pem \
     -T survey_data.fits \
     https://ws-uv.canfar.net/arc/files/projects/[project]/survey_data.fits