ColabFold Downloads

uniref30_2302.tar.gz

MD5 Hash

7c710858a3dcadd750b50e77875bc676

Byte Size

102918187842
uniref30_2202.tar.gz

MD5 Hash

2c7e7de12113559a97857da01389f3c1

Byte Size

82977314638
uniref30_2103.tar.gz

MD5 Hash

2af35af92a9b1cf287a2845a80adf0c3

Byte Size

75336202245
uniref30_2103_taxonomy.tar.gz

MD5 Hash

2847cfbf9831b8a3990c016d89d2a295

Byte Size

1350502783
bfd_mgy_colabfold.tar.gz

MD5 Hash

72e169cccbbd007cc26f106e6b68941e

Byte Size

91235687604
colabfold_envdb_202108.tar.gz

MD5 Hash

fb4976e65837fb8167d46ad0e3653c9c

Byte Size

117965643010
pdb100_230517.fasta.gz

MD5 Hash

9ed5d01c1f8ba8c281c5d05a2e6d1466

Byte Size

28432889
pdb100_foldseek_230517.tar.gz

MD5 Hash

ec5f0c493532417478f01b3ac8a30c8e

Byte Size

19189110724
pdb70_220313.fasta.gz

MD5 Hash

f43aa975d54b49004bdc3f89191685df

Byte Size

21231545
pdb70_from_mmcif_220313.tar.gz

MD5 Hash

e401ff2e05a957be27de48f427134e63

Byte Size

28627472935

Reference

Mirdita M., Schütz K., Moriwaki Y., Heo L., Ovchinnikov S., Steinegger M. ColabFold: Making protein folding accessible to all Nature Methods, doi: 10.1038/s41592-022-01488-1.

Database information

ColabFold databases are MMseqs2 expandable profile databases to generate diverse multiple sequence alignments to predict protein structures. They are the backend of our ColabFold MMseqs2 searches. Here you can download three databases: (1) UniRef30, (2) BFD/Mgnfiy and (3) ColabFold DB.

(1) UniRef30 is a 30% sequence identity clustered database based on UniRef100.
(2) BFD/Mgnfiy is a combination of BFD and Mgnfiy (2019_05). We merged both databases by searching the Mgnify sequences against the BFD cluster representative sequences. Each Mgnify sequence with a sequence identity high 30% and a local alignment that covers at least 90% of its length is assigned to the BFD cluster. All remaining sequences are clustered at 30% sequence identity and 90% coverage (--min-seq-id 0.3 -c 0.3 --cov-mode1 -s 3) and merged with the BFD clusters, resulting in 182 million cluster. For each cluster we keep only the 10 most diverse sequences (filterresult --diff 10)
(3) Colabfold DB is similarly contructured to BFD/Mgnify. It contains BFD/Mgnify, MetaEuk (Levy Karin et al), SMAG (Delmont et al), TOPAZ (Alexander et al), MGV (Nayfach et al), GPD (Camarillo-Guerrero et al) and MetaClust2.

Setup ColabFold Search

In order to setup the ColabFold MMseqs2 search you need an MMseqs2 version of commit or newer. Convert the UniRef30 and either the BFD/Mgnfiy or Colabfold DB using tsv2exprofiledb.

Build database


wget https://raw.githubusercontent.com/sokrypton/ColabFold/main/setup_databases.sh
chmod +x setup_databases.sh
./setup_databases.sh database/

Run the search script

We provide a user-friendly Python solution to search against the databases:


pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold"
pip install -q "jax[cuda]>=0.3.8,<0.4" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
colabfold_search input_sequences.fasta database/ msas

The result will be in result folder. It contains a uniref.a3m and a bfd.mgnify30.metaeuk30.smag30.a3m file. The query.fasta can contain multiple queries. Each query is seperated by null byte.

Searches against the ColabFoldDB can be done in two different modes: (1) Batch searches with many sequences against the ColabFoldDB quires a machine with approx. 128GB RAM. The search should be performed on the same machine that called setup_databases.sh since the database index size is adjusted to the main memory size. To search on computers with less main memory delete the index by removing all .idx files, this will force MMseqs2 to create an index on the fly in memory. MMSeqs2 is optimized for large input sequence sets sizes. (2) single query searches require the full index (the .idx files) to be kept in memory. This can be done with e.g. by using vmtouch. Thus, this type of search requires a machine with at least 768GB RAM for the ColabfoldDB. If the index is in memory use to --db-load-mode 3 parameter in colabfold_search to avoid index loading overhead.

License

All files are available under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.