
Chunked eyerisdb Database Export for Large Datasets
Source:vignettes/chunked-database-export.Rmd
chunked-database-export.RmdIntroduction
When working with large eyeris databases containing
millions of eye-tracking data points, traditional export methods can run
into memory limitations or create unwieldy files. The chunked database
export functionality in eyeris provides an out-of-the-box
solution for handling really large eyerisdb databases
by:
- Processing data in chunks to avoid memory issues
- Automatically splitting large files based on size limits
- Grouping tables by column structure to prevent SQL errors
-
Supporting both
CSVandParquetformats for optimal performance
This vignette walks through how to use these features after you’ve
created an eyerisdb database using
bidsify(db_enabled = TRUE).
Prerequisites
Before using the chunked export functions, you need:
- An
eyerisdbdatabase created withbidsify(db_enabled = TRUE) - The
arrowpackage installed (for Parquet support):install.packages("arrow")(arrowis included when installingeyerisfrom CRAN) - Sufficient disk space for the exported files
Basic Usage
Simple Export with Default Settings
The easiest way to export your entire database is with
eyeris_db_to_chunked_files():
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/your/bids/directory",
db_path = "my-project" # your database name
)
# view what was exported
print(result)Using the eyeris_db_to_chunked_files() function
defaults, this will: - Process 1 million rows at a time
(i.e., the default chunk size) - Create files up to 500MB
each (i.e., the default max file size) - Export all data types found in
your database - Save files to
bids_dir/derivatives/eyerisdb_export/my-proj/
Understanding the Output Structure
The function creates organized output files:
derivatives/eyerisdb_export/my-proj/
├── my-proj_timeseries_chunked_01.csv # Single file (< 500MB)
├── my-proj_events_chunked_01-of-02.csv # Multiple files due to size
├── my-proj_events_chunked_02-of-02.csv
├── my-proj_confounds_summary_goal_chunked_01.csv # Grouped by schema
├── my-proj_confounds_summary_stim_chunked_01.csv # Different column structure
├── my-proj_confounds_events_chunked_01.csv
├── my-proj_epoch_summary_chunked_01.csv
└── my-proj_epochs_pregoal_chunked_01-of-03.csv # Epoch-specific data
Advanced Configuration
Controlling File Sizes
You can customize the maximum file size to create smaller, more manageable files:
# Create smaller files for easy distribution
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "large-project",
max_file_size_mb = 100, # 100MB files instead of 500MB
chunk_size = 500000 # Process 500k rows at a time
)This is particularly useful when: - Uploading to cloud storage with size/transfer bandwidth limits - Sharing data via email or file transfer services - Working with limited storage space
Exporting Specific Data Types
For large databases, you may only need certain types of data:
# Export only pupil timeseries and events
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "large-project",
data_types = c("timeseries", "events"),
subjects = c("sub-001", "sub-002", "sub-003") # Specific subjects only
)Available data types typically include: - timeseries -
Preprocessed eye-tracking pupil data - events -
Experimental events - epochs - Epoched data around
events
- confounds_summary - Confound variables by epoch -
blinks - Detected blinks
Using Parquet Format
For better performance and compression, use Parquet format:
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "large-project",
file_format = "parquet",
max_file_size_mb = 200
)Parquet advantages: - Smaller file sizes (often
50-80% smaller than CSV) - Faster reading with
arrow::read_parquet() - Better data types
(preserves numeric precision) - Column-oriented storage
for analytics
Working with the Exported Files
Reading Single Files Back into R
# Read a single CSV file
data <- read.csv("path/to/timeseries_chunked.csv")
# Read a single Parquet file (requires arrow package)
if (requireNamespace("arrow", quietly = TRUE)) {
data <- arrow::read_parquet("path/to/timeseries_chunked.parquet")
}Combining Multiple Split Files
When files are split due to size limits, you can recombine them:
# Find all parts of a split dataset
files <- list.files(
"path/to/eyerisdb_export/my-project/",
pattern = "timeseries_chunked_.*\\.csv$",
full.names = TRUE
)
# Read and combine all parts
combined_data <- do.call(rbind, lapply(files, read.csv))
# Or use the built-in helper function
combined_data <- read_eyeris_parquet(
parquet_dir = "path/to/eyerisdb_export/my-project/",
data_type = "timeseries"
)Advanced Use Cases
Custom Chunk Processing
For specialized analysis, you can process chunks with custom functions:
# Connect to database directly
con <- eyeris_db_connect("/path/to/bids", "large-project")
# Define custom analysis function for pupil data
analyze_chunk <- function(chunk) {
# Calculate summary statistics for this chunk
stats <- data.frame(
n_rows = nrow(chunk),
subjects = length(unique(chunk$subject_id)),
mean_eye_x = mean(chunk$eye_x, na.rm = TRUE),
mean_eye_y = mean(chunk$eye_y, na.rm = TRUE),
mean_pupil_raw = mean(chunk$pupil_raw, na.rm = TRUE),
mean_pupil_processed = mean(chunk$pupil_raw_deblink_detransient_interpolate_lpfilt_z, na.rm = TRUE),
missing_pupil_pct = sum(is.na(chunk$pupil_raw)) / nrow(chunk) * 100,
hz_modes = paste(unique(chunk$hz), collapse = ",")
)
# Save chunk summary (append to growing file)
write.csv(stats, "chunk_summaries.csv", append = file.exists("chunk_summaries.csv"))
return(TRUE) # Indicate success
}
# Hypothetical example: process large timeseries dataset in chunks
result <- process_chunked_query(
con = con,
query = "
SELECT subject_id, session_id, time_secs, eye_x, eye_y,
pupil_raw, pupil_raw_deblink_detransient_interpolate_lpfilt_z, hz
FROM timeseries_01_enc_clamp_run01
WHERE pupil_raw > 0 AND eye_x IS NOT NULL
ORDER BY time_secs
",
chunk_size = 100000,
process_chunk = analyze_chunk
)
eyeris_db_disconnect(con)Handling Very Large Databases
For databases with hundreds of millions of rows:
# Optimize for very large datasets
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "massive-project",
chunk_size = 2000000, # 2M rows per chunk for efficiency
max_file_size_mb = 1000, # 1GB files (larger but fewer files)
file_format = "parquet", # Better compression
data_types = "timeseries" # Focus on primary data type for analysis
)Performance Tips
Optimizing Chunk Size
- Smaller chunks (100k-500k rows): Use when memory is limited
-
Medium chunks (1M rows): Good default for most
systems
- Larger chunks (2M+ rows): Use for very large datasets with ample memory
Troubleshooting
Memory Issues
If you encounter out-of-memory errors:
# Reduce chunk size
result <- eyeris_db_to_chunked_files(
bids_dir = "/path/to/bids",
db_path = "project",
chunk_size = 250000, # Smaller chunks
verbose = TRUE # Monitor progress
)SQL Query Length Errors
The function automatically handles this by processing tables in batches, but if you encounter issues:
- The function processes a maximum of 50 tables per SQL query
- Tables are automatically grouped by compatible column structures
- Complex epoch data is separated into schema-compatible groups
Column Structure Mismatches
When you see “Set operations can only apply to expressions with the same number of result columns”:
- The function automatically detects and groups tables by column structure
- Tables with different schemas (e.g., different epoch types) are exported separately
- No manual intervention required - this is handled automatically
Getting Help
For additional help:
- Check function documentation:
?eyeris_db_to_chunked_files - View database contents:
eyeris_db_summary(bids_dir, db_path) - List available tables:
eyeris_db_list_tables(con) - Monitor progress with
verbose = TRUE
Summary
The built-in chunked eyerisdb database export
functionality provides a robust solution for working with large
eyerisdb databases. Key benefits include:
- Scalability: Handle databases of any size without memory issues
- Flexibility: Choose file formats, sizes, and data types to export
-
Reliability: Automatic error handling for common
database issues
- Efficiency: Optimized chunk processing with progress tracking
This makes it possible to work with even the largest eye-tracking/pupillometry datasets while maintaining performance/reliability without sacrificing the ability to share high-quality, reproducible datasets that support collaborative and open research.