Build
Build Commands
The Arc Memory CLI provides commands for building the knowledge graph from Git commits, GitHub PRs and issues, and ADRs.
Related Documentation:
- Authentication Commands - Authenticate before building to include GitHub data
- Doctor Commands - Verify your build status and graph statistics
- Trace Commands - Trace history after building your graph
- Building Graphs Examples - Detailed examples of building graphs
Overview
The build process discovers and executes plugins to ingest data from various sources, creates nodes and edges in the knowledge graph, and saves the result to a SQLite database. It supports both full and incremental builds, allowing for efficient updates to the graph.
Commands
arc build
Build the knowledge graph from Git, GitHub, and ADRs.
This is the main command for building the knowledge graph. It processes data from all available plugins and creates a SQLite database containing the knowledge graph.
Options
--repo
,-r TEXT
: Path to the Git repository (default: current directory).--output
,-o TEXT
: Path to the output database file (default: ~/.arc/graph.db).--max-commits INTEGER
: Maximum number of commits to process (default: 5000).--days INTEGER
: Maximum age of commits to process in days (default: 365).--incremental
: Only process new data since last build (default: False).--pull
: Pull the latest CI-built graph (not implemented yet).--token TEXT
: GitHub token to use for API calls.--debug
: Enable debug logging.
Examples
Build Process Flow
The build process follows these steps:
-
Initialization:
- Ensure the output directory exists
- Check if the repository is a Git repository
- Load existing manifest for incremental builds
- Initialize the database
-
Plugin Discovery:
- Discover and register plugins using the plugin registry
- Plugins are discovered using entry points
-
Data Ingestion:
- For each plugin:
- Get last processed data (for incremental builds)
- Call the plugin’s
ingest
method with appropriate parameters - Collect nodes and edges from the plugin
- For each plugin:
-
Database Operations:
- Write all nodes and edges to the database
- Get node and edge counts
- Compress the database
-
Manifest Creation:
- Create a build manifest with metadata about the build
- Save the manifest for future incremental builds
Incremental Builds
Incremental builds only process new data since the last build, making them much faster than full builds. The process works as follows:
- Load the existing build manifest
- Pass the last processed data to each plugin
- Plugins use this data to determine what’s new
- Only new nodes and edges are added to the database
To run an incremental build:
Performance Considerations
Build Times
The time required to build a knowledge graph depends on several factors:
Repository Size | Commits | PRs/Issues | Estimated Full Build Time | Incremental Build Time |
---|---|---|---|---|
Small | <500 | <100 | 10-30 seconds | <1 second |
Medium | 500-5000 | 100-1000 | 1-5 minutes | 1-3 seconds |
Large | 5000+ | 1000+ | 5-15 minutes | 3-10 seconds |
Very Large | 10000+ | 5000+ | 15-60 minutes | 10-30 seconds |
These estimates assume:
- A modern computer (quad-core CPU, 8GB+ RAM)
- Good network connection for GitHub API calls
- GitHub API rate limits not being hit
Resource Requirements
-
CPU: The build process is multi-threaded and benefits from multiple cores
- Minimum: Dual-core CPU
- Recommended: Quad-core CPU or better
-
Memory:
- Minimum: 4GB RAM
- Recommended: 8GB RAM
- Large repositories (10000+ commits): 16GB RAM
-
Disk Space:
- Small repositories: ~10-50MB
- Medium repositories: ~50-200MB
- Large repositories: ~200MB-1GB
- Very large repositories: 1GB+
-
Network:
- GitHub API calls require internet connectivity
- Bandwidth requirements are modest, but latency can affect build times
Optimizing Build Performance
- Use Incremental Builds: After the initial build, always use
--incremental
for faster updates - Limit Scope: Use
--max-commits
and--days
to limit the data processed - GitHub Token: Ensure you’re authenticated to avoid rate limits
- Local Network: Build on a fast, low-latency network connection
- SSD Storage: Using SSD rather than HDD can significantly improve performance
Troubleshooting
If you encounter issues during the build process:
- GitHub Rate Limiting: If you hit GitHub API rate limits, provide a token with higher limits or wait and try again.
- Large Repositories: For very large repositories, use the
--max-commits
and--days
options to limit the amount of data processed. - Debug Mode: Run with
--debug
flag to see detailed logs:arc build --debug
- Database Corruption: If the database becomes corrupted, delete it and run a full build again.
- Plugin Errors: If a specific plugin fails, check its error message and ensure it has the necessary permissions and configuration.