Project Overview
anndataR is a Bioconductor R package that provides native R support for reading/writing .h5ad (AnnData) files and bidirectional conversion between AnnData objects and popular Bioconductor/Seurat formats.
Key Features:
- Native R reading/writing of
.h5adfiles (HDF5 backend) - Three AnnData implementations:
InMemoryAnnData,HDF5AnnData,ReticulateAnnData - Bidirectional conversion: SingleCellExperiment ↔︎ AnnData ↔︎ Seurat
- S4 coercion system for seamless type conversion
- Python interoperability via reticulate
Architecture
R6 Class Hierarchy
AbstractAnnData (abstract base class)
├── InMemoryAnnData (all data in RAM)
├── HDF5AnnData (HDF5-backed, lazy loading)
├── ReticulateAnnData (Python anndata wrapper via reticulate)
└── AnnDataView (lazy subsetting view, no data copying)When to use each:
- InMemoryAnnData: Default for most operations, fast for small/medium datasets
- HDF5AnnData: Large datasets exceeding RAM, on-disk persistence
- ReticulateAnnData: Interoperability with Python anndata library, testing against Python behavior
- AnnDataView: Lazy subsetting/slicing without copying data; convert to concrete implementation when needed
Core Slots (AnnData spec)
-
X: Main matrix (n_obs × n_vars) -
layers: Named list of alternative matrices with same dimensions as X -
obs: DataFrame with observation (cell) metadata (n_obs rows) -
var: DataFrame with variable (gene) metadata (n_vars rows) -
obsm: List of observation-aligned matrices (e.g., PCA, UMAP embeddings) -
varm: List of variable-aligned matrices -
obsp: List of observation pairwise matrices (e.g., cell-cell distances) -
varp: List of variable pairwise matrices -
uns: Unstructured metadata (arbitrary nested lists/dicts)
Important validation rules:
- All matrices in X/layers must have same shape:
[n_obs, n_vars] - Row names of obs/var define observation/variable names
- obsm/varm matrices must align: first dimension matches n_obs/n_vars
- obsp/varp must be square pairwise matrices
Critical Implementation Details
1. obs_names/var_names Architecture
obs_names and var_names are now stored separately from obs/var data.frames:
# InMemoryAnnData and HDF5AnnData store names separately
private$.obs_names # Character vector of observation names
private$.var_names # Character vector of variable names
# obs/var data.frames are stored WITHOUT rownames internally
private$.obs # Data.frame with NULL rownames
private$.var # Data.frame with NULL rownames
# Dimnames are added ON-THE-FLY when users access data
ad$obs # Returns data.frame with rownames = obs_names
ad$X # Returns matrix with dimnames from obs_names/var_namesWhy this matters:
- All matrix data (X, layers, obsm, varm, obsp, varp) is stored internally without dimnames
- Dimnames are dynamically added via
.add_matrix_dimnames()and.add_obsvar_dimnames()helper methods - This ensures consistency with Python anndata where obs/var names are separate from the data
- When setting obs/var, extract rownames first if present, then strip them
Pattern for setters:
# Extract names before validation
if (!is.null(value) && has_row_names(value)) {
private$.obs_names <- rownames(value)
rownames(value) <- NULL
}
private$.obs <- private$.validate_obsvar_dataframe(value, "obs")Use obs_names/var_names properties directly:
- Don’t use
rownames(ad$obs)orcolnames(ad$X)in internal code - Use
ad$obs_namesandad$var_namesfor reliable access to names
2. AnnDataView for Lazy Subsetting
AnnDataView provides lazy subsetting without data copying:
# Create a view with S3 [ operator
ad <- AnnData(
X = matrix(1:15, 3L, 5L),
obs = data.frame(row.names = LETTERS[1:3], cell_type = c("A", "B", "A"))
)
view <- ad[ad$obs$cell_type == "A", ] # Returns AnnDataView, no data copied
# Convert to concrete implementation when needed
result <- view$as_InMemoryAnnData() # Now subsetting is appliedKey characteristics:
- Inherits from AbstractAnnData
- Stores base AnnData object and subset indices
- All getters apply subsetting on-the-fly
- Setters are disabled - must convert to concrete implementation first
- Use
.apply_subset()helper for matrix/data.frame subsetting - Use
.apply_vector_subset()for vector subsetting (obs_names, var_names)
Testing pattern: See tests/testthat/test-AnnDataView.R for usage examples
3. S3 Methods for AbstractAnnData
Standard R S3 methods now work on all AnnData objects:
# Dimension methods
dim(ad) # [n_obs, n_vars]
nrow(ad) # n_obs
ncol(ad) # n_vars
dimnames(ad) # list(obs_names, var_names)
rownames(ad) # obs_names
colnames(ad) # var_names
# Subsetting with [ operator
ad[1:5, ] # Subset observations, returns AnnDataView
ad[, 1:10] # Subset variables, returns AnnDataView
ad[1:5, 1:10] # Subset both, returns AnnDataViewImplementation: See R/AbstractAnnData-s3methods.R for all S3 methods
4. HDF5 File Management
Pattern: HDF5AnnData manages file handles with lifecycle:
# Automatic closure on finalization
adata <- read_h5ad("file.h5ad") # close_on_finalize = TRUE
# Manual closure required for explicit construction
adata <- HDF5AnnData$new("file.h5ad")
adata$close() # Must call explicitly
# Check validity before operations
private$.check_file_valid() # Throws if handle closedTesting pattern: test-h5ad-fileclosure.R validates handles close properly.
Testing Infrastructure
Conditional Test Skipping
Pattern: Helper functions check for optional dependencies:
# tests/testthat/helper-skip_if_no_anndata.R
skip_if_no_anndata <- function() {
if (!rlang::is_installed("reticulate")) {
skip("reticulate not installed")
}
# Check Python anndata module
if (!reticulate::py_module_available("anndata")) {
skip("Python anndata not available")
}
}
# Usage in tests
test_that("ReticulateAnnData works", {
skip_if_no_anndata()
# ... test code ...
})Available skip helpers:
-
skip_if_no_anndata(): Python anndata + reticulate -
skip_if_no_dummy_anndata(): anndataR.testdata Python module (for roundtrip tests) -
skip_if_no_h5diff(): h5diff CLI tool
Roundtrip Testing Pattern
Goal: Ensure R read/write matches Python anndata behavior
# tests/testthat/test-roundtrip-X.R example
test_that("Dense X roundtrips correctly", {
skip_if_no_dummy_anndata()
# Generate test data with known structure
adata <- generate_dataset(
X_type = "dense",
obs = data.frame(row.names = LETTERS[1:10]),
var = data.frame(row.names = letters[1:5])
)
# Write R → H5AD
tmp <- tempfile(fileext = ".h5ad")
adata$write_h5ad(tmp)
# Read back and verify
adata2 <- read_h5ad(tmp)
expect_equal(adata$X, adata2$X)
# Compare with Python via dummy_anndata
py <- reticulate::import("anndataR.testdata")
py_adata <- py$dummy_anndata(X = "dense")
expect_equal_py(adata, py_adata) # Custom helper
})Custom helpers:
-
expect_equal_py(r_adata, py_adata): Compare R and Python AnnData objects -
generate_dataset(): Create test AnnData with configurable matrix types
Mock Data Generation
Usage:
# R/generate_dataset.R
adata <- generate_dataset(
X_type = "dense", # "dense", "sparse", "csparse", "rsparse"
obs = data.frame(row.names = LETTERS[1:100]),
var = data.frame(row.names = letters[1:50]),
n_layers = 2,
obs_has_row_names = TRUE # Toggle for testing edge cases
)Conversion System
From External Formats
Pattern: from_*() functions for explicit conversion:
# from_SingleCellExperiment.R
adata <- from_SingleCellExperiment(
sce,
output_class = "InMemoryAnnData", # or "HDF5AnnData"
X_name = "counts", # Which assay → X
layers = c("logcounts"), # Additional assays → layers
uns_keys = c("pca") # Which metadata → uns
)
# from_Seurat.R
adata <- from_Seurat(
seurat_obj,
output_class = "InMemoryAnnData",
assay = "RNA" # Which assay to extract
)Guessing pattern: from_Seurat() uses helper functions to intelligently map:
-
.from_Seurat_guess_layers(): Identify assay data slots -
.from_Seurat_guess_obsms(): Extract dimensionality reductions (PCA, UMAP) -
.from_Seurat_guess_uns(): Map miscellaneous metadata
To External Formats
Pattern: as_*() functions with optional parameters:
# as_SingleCellExperiment.R
sce <- as_SingleCellExperiment(
adata,
X_name = "counts", # What to call X assay
layer_names = NULL # Which layers to include (NULL = all)
)
# as_Seurat.R
seurat <- as_Seurat(
adata,
assay_name = "RNA" # Assay name in Seurat object
)Development Workflow
Build and Test Commands
# Full R CMD check (Bioconductor standards)
R CMD build .
R CMD check --as-cran anndataR_*.tar.gz
# Quick test suite
R -e 'devtools::test()'
# Run specific test file
R -e 'devtools::test(filter = "roundtrip-X")'
# Lint checking
R -e 'lintr::lint_package()'
# Check code formatting
air format --check .
# Reformat code
air format .Lintr Configuration
Key rules:
- Line length: 120 characters max
- Use
paste()to wrap long strings in test helpers - Prefer explicit
::for package functions in examples
Example fix:
# ❌ Too long
expect_warning(as(adata, "Seurat"), "Consider using as_Seurat() for more control over the conversion")
# ✅ Wrapped
expect_warning(
as(adata, "Seurat"),
paste(
"Consider using as_Seurat() for more control",
"over the conversion"
)
)Dependency Management
Pattern: Check for optional packages before use:
# R/check_requires.R
check_requires <- function(package, reason = NULL) {
if (!rlang::is_installed(package)) {
msg <- c(
"!" = "Package {.pkg {package}} is required",
"i" = "Install with: {.code install.packages(\"{package}\")}"
)
if (!is.null(reason)) {
msg <- c(msg, "i" = reason)
}
cli::cli_abort(msg)
}
}
# Usage
as_Seurat <- function(adata, ...) {
check_requires("SeuratObject", "for converting to Seurat objects")
# ... conversion code ...
}Common Pitfalls
1. obs_names/var_names Separation
Problem: Trying to access obs/var names via rownames instead of dedicated properties
Wrong:
# ❌ Internal code should not use this pattern
obs_ids <- rownames(adata$obs)
gene_ids <- colnames(adata$X)Correct:
# ✅ Always use dedicated properties
obs_ids <- adata$obs_names
gene_ids <- adata$var_namesWhy: obs_names/var_names are stored separately and added on-the-fly to user-facing data
2. Row Names Handling
Problem: R data.frames require unique row names, but AnnData allows duplicates
Solution: generate_dataframe.R has obs_has_row_names parameter for testing edge cases
3. Sparse Matrix Compatibility
Problem: Different sparse matrix classes (Matrix::dgCMatrix vs Matrix::dgRMatrix)
Solution: Use generate_dataset(X_type = "csparse") for CSC, "rsparse" for CSR testing
4. NULL vs Empty List in uns
Problem: Python distinguishes None from {}; R treats NULL differently
Solution: As of anndata 0.12.0, write NULL as empty HDF5 dataset. Controlled by options(anndataR.write_null = TRUE)
5. Closure Variable Capture in Factories
Problem: Loop variables not captured in closure
# ❌ WRONG - all handlers reference final iteration value
for (class in classes) {
setAs(class, "Seurat", function(from) convert(from, class))
}
# ✅ CORRECT - force() captures variable value
.make_convert_handler <- function(convert_fn, from_str, to_str) {
force(convert_fn)
force(from_str)
force(to_str)
function(from) convert_fn(from)
}File Organization
Source Files by Category
Core Classes:
-
R/AbstractAnnData.R: Base class with abstract slots and validation -
R/AbstractAnnData-s3methods.R: S3 methods (dim, nrow, ncol, dimnames,[) -
R/InMemoryAnnData.R: RAM-based implementation -
R/HDF5AnnData.R: HDF5-backed implementation -
R/ReticulateAnnData.R: Python wrapper (experimental) -
R/AnnDataView.R: Lazy view for subsetting without data copying
I/O:
-
R/read_h5ad.R,R/read_h5ad_helpers.R: Reading HDF5 files -
R/write_h5ad.R,R/write_h5ad_helpers.R: Writing HDF5 files -
R/write_hdf5_helpers.R: Low-level HDF5 utilities
Conversion:
-
R/as-coercions.R: S4 coercion registration -
R/as_AnnData.R: Generic converter to AnnData -
R/as_SingleCellExperiment.R: AnnData → SCE -
R/as_Seurat.R: AnnData → Seurat -
R/from_SingleCellExperiment.R: SCE → AnnData -
R/from_Seurat.R: Seurat → AnnData
Testing:
-
R/generate_dataset.R,R/generate_*.R: Mock data generation -
tests/testthat/helper-*.R: Test utilities and skip helpers -
tests/testthat/test-roundtrip-*.R: Python compatibility tests -
tests/testthat/test-as-*.R: Conversion validation -
tests/testthat/test-AnnDataView.R: Lazy subsetting tests -
tests/testthat/test-AbstractAnnData-s3methods.R: S3 method tests
Utilities:
-
R/check_requires.R: Dependency validation -
R/ui.R: CLI messaging helpers -
R/utils.R: Miscellaneous helpers -
R/known_issues.R: Track known bugs/limitations
Documentation Standards
Roxygen2 Patterns
Class documentation:
#' @title InMemoryAnnData
#' @description Implementation of an in-memory AnnData object.
#' @seealso [AnnData-usage] for details on creating and using AnnData objects
#' @family AnnData classes
#' @examples
#' adata <- AnnData(X = matrix(1:15, 3L, 5L), ...)Cross-references:
- Use
[AnnData-usage]for user-facing documentation - Link related functions:
@seealso [read_h5ad()], [write_h5ad()]
CI/CD Considerations
Bioconductor requirements:
- Must pass
R CMD check --as-cranwith no errors/warnings - BiocCheck compliance
- All examples must run successfully
- Conditional package usage (see
check_requirespattern)
Quick Reference
Creating AnnData Objects
# In-memory (default)
adata <- AnnData(
X = matrix(1:15, 3L, 5L),
obs = data.frame(row.names = LETTERS[1:3]),
var = data.frame(row.names = letters[1:5])
)
# HDF5-backed
adata <- HDF5AnnData$new("path.h5ad", mode = "w")
adata$X <- matrix(1:15, 3L, 5L)
# ... set other slots ...
# From file
adata <- read_h5ad("file.h5ad")Conversion Examples
# SingleCellExperiment → AnnData
adata <- from_SingleCellExperiment(sce, X_name = "counts")
# AnnData → SingleCellExperiment
sce <- as_SingleCellExperiment(adata)
# Seurat → AnnData
adata <- from_Seurat(seurat_obj)
# AnnData → Seurat
seurat <- as_Seurat(adata)
# S4 coercion (if registered)
sce <- as(adata, "SingleCellExperiment")Accessing Data
# Dimensions (using S3 methods)
dim(adata) # [n_obs, n_vars]
nrow(adata) # n_obs
ncol(adata) # n_vars
dimnames(adata) # list(obs_names, var_names)
rownames(adata) # obs_names
colnames(adata) # var_names
# Or using R6 methods
adata$n_obs() # Number of observations
adata$n_vars() # Number of variables
# Main matrix
adata$X # Read
adata$X <- mat # Write
# Metadata
adata$obs # Observation metadata
adata$var # Variable metadata
adata$uns # Unstructured metadata
# Additional matrices
adata$layers[["raw"]] # Named layer
adata$obsm[["X_pca"]] # Observation matrix (PCA)
adata$obsp[["distances"]] # Pairwise matrixSubsetting with AnnDataView
# Subsetting returns AnnDataView (lazy, no data copied)
view <- adata[1:10, ] # Subset observations
view <- adata[, c("gene1", "gene2")] # Subset variables
view <- adata[adata$obs$cell_type == "T", ] # Conditional subsetting
# Convert to concrete implementation to apply changes
result <- view$as_InMemoryAnnData()
result <- view$as_HDF5AnnData("output.h5ad")Questions to Ask When Contributing
- Does this need Python compatibility? → Add roundtrip test
-
Requires optional package? → Use
check_requires()+ conditional tests - Modifying validation? → Check AbstractAnnData validators
-
New matrix type? → Add to
generate_dataset()for testing - Changing uns handling? → Consider Python dict requirements
-
HDF5 changes? → Verify with
h5diffagainst Python output -
New conversion feature? → Update both
from_*()andas_*()paths
Additional Resources
- AnnData spec: Official Python documentation
-
vignettes/software_design.Rmd: Detailed architecture diagrams -
inst/known_issues.yaml: Tracked bugs and workarounds - Bioconductor submission guidelines