ColabFold databases are MMseqs2 expandable profile databases to generate diverse multiple sequence alignments to predict protein structures. They are the backend of our ColabFold MMseqs2 searches. Here you can download three databases: (1) UniRef30, (2) BFD/Mgnfiy and (3) ColabFold DB.
In order to setup the ColabFold MMseqs2 search you need an MMseqs2 version of commit or newer. Convert the UniRef30 and either the BFD/Mgnfiy or Colabfold DB using
wget https://raw.githubusercontent.com/sokrypton/ColabFold/main/setup_databases.sh chmod +x setup_databases.sh ./setup_databases.sh database/
pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" pip install -q "jax[cuda]>=0.3.8,<0.4" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html colabfold_search input_sequences.fasta database/ msas
The result will be in
result folder. It contains a
uniref.a3m and a
bfd.mgnify30.metaeuk30.smag30.a3m file. The
query.fasta can contain multiple queries. Each query is seperated by null byte.
Searches against the ColabFoldDB can be done in two different modes: (1) Batch searches with many sequences against the ColabFoldDB quires a machine with approx. 128GB RAM. The search should be performed on the same machine that called setup_databases.sh since the database index size is adjusted to the main memory size. To search on computers with less main memory delete the index by removing all .idx files, this will force MMseqs2 to create an index on the fly in memory. MMSeqs2 is optimized for large input sequence sets sizes. (2) single query searches require the full index (the .idx files) to be kept in memory. This can be done with e.g. by using vmtouch. Thus, this type of search requires a machine with at least 768GB RAM for the ColabfoldDB. If the index is in memory use to --db-load-mode 3 parameter in colabfold_search to avoid index loading overhead.
All files are available under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.