For research purposes, I want to mirror the GEO directory of the NCBI FTP site. These data are huge: more than 1,000 annotation platforms files and more than 20,000 GSE files! Platforms files are updated each month and GSE files on a daily basis. As you may imagine, I quickly forgot the idea of developping some PERL programs to achieve this task.
I first try to use a software dedicated to the mirroring of scientific databases: BioMaj. With the annotation platforms files, BioMaj works fine: files were quickly downloaded and inserted in SQL databases with the management of post-process scripts included in BioMaj. But, with the GSE files, I think I reached the limit of BioMaj, because of the structure of the GSE datafiles. Indeed, the « seriesmatrix » directory is organized in many subdirectories (almost 20,000 in which there are at least one file) and BioMaj spent more than 6 hours only to list the directory content!!! Then, the downloading and the management of the files (as BioMaj handles files versioning) were quite long, too! More problematic is that BioMaj freezed / halted / stopped during the update process. After a while, I finally try another way to make a local mirror: using classic unix tools, as… lftp, which has a mirroring and a multi-threading features!
Using lftp is quite simple, I found some usefull shell scripts here, which I’ve modified:
#!/bin/bash HOST="ftp.ncbi.nih.gov" USER="anonymous" PASS="anonymous" LOG_dir="/data/GEO/logs" LCD="/data/GEO/platforms" RCD="/pub/geo/DATA/annotation/platforms" DATE=`date '+%Y-%m-%d_%H%M'` lftp -c "set ftp:list-options -a; open ftp://$USER:$PASS@$HOST; lcd $LCD; cd $RCD; mirror --verbose --exclude-glob old/ --parallel=10 --log="$LOG_dir/GPL_$DATE" "
This script works with the GSE datafiles and it’s relatively fast (about 45 min to get the directory listing and less than one night (!) to download the files.