ADR-003: Plugin Architecture for Data Ingestion
Status: Proposed
Date: 2025-04-23
Decision makers: Jarrod Barnes (Founder), Core Eng Team
Context: Arc Memory needs to support multiple data sources beyond Git, GitHub, and ADRs. This ADR outlines the plugin architecture that will enable extensible data ingestion.
1 · Problem Statement
Arc Memory currently has hardcoded ingestors for Git, GitHub, and ADRs. As we expand to support additional data sources like Notion, Jira, Linear, and G-Suite, we need a flexible architecture that:
- Allows adding new data sources without modifying core code
- Maintains a consistent interface for all data sources
- Supports incremental builds for efficient updates
- Enables third-party developers to create their own integrations
- Preserves type safety and testability
2 · Proposed Solution
We will implement a plugin architecture based on Python’s Protocol pattern and entry point discovery system.
2.1 IngestorPlugin Protocol
from typing import Protocol, List, Tuple, Dict, Any, Optional
from arc_memory.schema.models import Node, Edge
class IngestorPlugin(Protocol):
def get_name(self) -> str:
"""Return a unique name for this plugin."""
...
def get_node_types(self) -> List[str]:
"""Return a list of node types this plugin can create."""
...
def get_edge_types(self) -> List[str]:
"""Return a list of edge types this plugin can create."""
...
def ingest(self, last_processed: Optional[Dict[str, Any]] = None) -> Tuple[List[Node], List[Edge], Dict[str, Any]]:
"""
Ingest data from the source and return nodes, edges, and metadata.
Args:
last_processed: Optional dictionary containing metadata from the previous run,
used for incremental ingestion.
Returns:
A tuple containing:
- List[Node]: List of nodes created from the data source
- List[Edge]: List of edges created from the data source
- Dict[str, Any]: Metadata about the ingestion process, used for incremental builds
"""
...
2.2 IngestorRegistry
class IngestorRegistry:
def __init__(self):
"""Initialize an empty registry."""
self.ingestors: Dict[str, IngestorPlugin] = {}
def register(self, ingestor: IngestorPlugin) -> None:
"""Register a plugin with the registry."""
self.ingestors[ingestor.get_name()] = ingestor
def get(self, name: str) -> Optional[IngestorPlugin]:
"""Get a plugin by name."""
return self.ingestors.get(name)
def list_plugins(self) -> List[str]:
"""List all registered plugins."""
return list(self.ingestors.keys())
def get_all(self) -> List[IngestorPlugin]:
"""Get all registered plugins."""
return list(self.ingestors.values())
2.3 Plugin Discovery
import pkg_resources
import logging
from typing import Type
logger = logging.getLogger(__name__)
def discover_plugins() -> IngestorRegistry:
"""Discover and register all available plugins."""
registry = IngestorRegistry()
# Register built-in plugins
from arc_memory.ingest.git import GitIngestor
from arc_memory.ingest.github import GitHubIngestor
from arc_memory.ingest.adr import ADRIngestor
registry.register(GitIngestor())
registry.register(GitHubIngestor())
registry.register(ADRIngestor())
# Discover and register third-party plugins
for entry_point in pkg_resources.iter_entry_points('arc_memory.plugins'):
try:
plugin_class: Type[IngestorPlugin] = entry_point.load()
registry.register(plugin_class())
logger.info(f"Loaded plugin: {entry_point.name}")
except Exception as e:
logger.warning(f"Failed to load plugin {entry_point.name}: {e}")
return registry
2.4 Integration with Build Process
def build_graph(
repo_path: Path,
output_path: Path,
incremental: bool = False,
plugins: Optional[List[str]] = None
) -> None:
"""Build the knowledge graph using available plugins."""
# Discover plugins
registry = discover_plugins()
# Filter plugins if specified
active_plugins = registry.get_all()
if plugins:
active_plugins = [p for p in active_plugins if p.get_name() in plugins]
# Load existing manifest for incremental builds
manifest = None
if incremental:
manifest = load_build_manifest()
# Initialize lists for all nodes and edges
all_nodes = []
all_edges = []
# Process each plugin
for plugin in active_plugins:
plugin_name = plugin.get_name()
logger.info(f"Processing plugin: {plugin_name}")
# Get last processed data for this plugin
last_processed = None
if manifest and incremental:
last_processed = manifest.last_processed.get(plugin_name, {})
# Run the plugin
nodes, edges, metadata = plugin.ingest(last_processed)
# Add results to the combined lists
all_nodes.extend(nodes)
all_edges.extend(edges)
# Update manifest
if manifest:
manifest.last_processed[plugin_name] = metadata
# Write to database
# ... (existing database write code)
# Update and save manifest
# ... (existing manifest update code)
3 · Alternatives Considered
3.1 Class Inheritance
We considered using a base class with abstract methods instead of a Protocol:
class BaseIngestor(ABC):
@abstractmethod
def get_name(self) -> str: ...
@abstractmethod
def ingest(self, last_processed: Optional[Dict] = None) -> Tuple[List[Node], List[Edge], Dict]: ...
Pros:
- Enforces implementation of required methods at instantiation time
- Can provide default implementations for some methods
Cons:
- Less flexible than Protocol (requires inheritance)
- More difficult to adapt existing classes
- Doesn’t work well with multiple inheritance
3.2 Function-Based Plugins
We considered a simpler approach using functions as plugins:
def register_plugin(name: str):
def decorator(func):
PLUGINS[name] = func
return func
return decorator
@register_plugin("github")
def ingest_github(last_processed=None):
# Implementation
return nodes, edges, metadata
Pros:
- Simpler implementation
- Less boilerplate
Cons:
- Less structured
- Harder to maintain type safety
- Doesn’t encapsulate related functionality
3.3 Plugin Configuration Files
We considered requiring plugins to be registered in a configuration file:
# ~/.arc/plugins.yaml
plugins:
- name: github
enabled: true
module: arc_memory.ingest.github
class: GitHubIngestor
Pros:
- More explicit control over which plugins are loaded
- Can configure plugin parameters
Cons:
- More complex setup for users
- Requires additional configuration parsing
- Less discoverable
4 · Impact
4.1 Positive Impacts
- Extensibility: Easy addition of new data sources
- Decoupling: Data source implementations are isolated from core code
- Consistency: All data sources follow the same interface
- Discoverability: Third-party plugins can be discovered automatically
- Type Safety: Protocol ensures type safety with static type checkers
4.2 Negative Impacts
- Complexity: Slightly more complex than direct function calls
- Performance: Small overhead from plugin discovery and registry lookups
- Learning Curve: Developers need to understand the plugin system
5 · Decision
We will implement the plugin architecture as described in Section 2, using:
- The
IngestorPlugin
Protocol for defining the plugin interface
- The
IngestorRegistry
class for managing plugins
- The entry point discovery mechanism for third-party plugins
- Integration with the build process to use all available plugins
This approach provides the best balance of flexibility, type safety, and ease of use.
Proposed – 2025-04-23
— Jarrod Barnes
6 · Implementation Checklist