SNP_calling rule crashes for samples with more than 131072 contigs per sample due to a limitation in LoFreq #102

DennisSchmitz · 2019-10-25T14:54:16Z

This issue was emailed to me by @RozemarijnVanDerPlaats.

One specific sample kept crashing in the SNP_calling step, see DRMAA log below:

Error in rule SNP_calling:
    jobid: 0
    output: data/scaffolds_filtered/4_S4_scaffolds_ge500nt.fasta.fai, data/scaffolds_filtered/4_S4_unfiltered.vcf, data/scaffolds_filtered/4_S4_filtered.vcf, data/scaffolds_filtered/4_S4_filtered.vcf.gz, data/scaffolds_filtered/4_S4_filtered.vcf.gz.tbi
    log: logs/SNP_calling_4_S4.log
    conda-env: /mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965

RuleException:
CalledProcessError in line 366 of /mnt/scratch_dir/plaatvdr/Jovian/Snakefile:
Command 'source /mnt/miniconda/bin/activate '/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965'; set -euo pipefail;  samtools faidx -o data/scaffolds_filtered/4_S4_scaffolds_ge500nt.fasta.fai data/scaffolds_filtered/4_S4_scaffolds_ge500nt.fasta > logs/SNP_calling_4_S4.log 2>&1
lofreq call-parallel -d 20000 --no-default-filter --pp-threads 12 -f data/scaffolds_filtered/4_S4_scaffolds_ge500nt.fasta -o data/scaffolds_filtered/4_S4_unfiltered.vcf data/scaffolds_filtered/4_S4_sorted.bam >> logs/SNP_calling_4_S4.log 2>&1
lofreq filter -a 0.05 -i data/scaffolds_filtered/4_S4_unfiltered.vcf -o data/scaffolds_filtered/4_S4_filtered.vcf >> logs/SNP_calling_4_S4.log 2>&1
bgzip -c data/scaffolds_filtered/4_S4_filtered.vcf 2>> logs/SNP_calling_4_S4.log 1> data/scaffolds_filtered/4_S4_filtered.vcf.gz
tabix -p vcf data/scaffolds_filtered/4_S4_filtered.vcf.gz >> logs/SNP_calling_4_S4.log 2>&1' returned non-zero exit status 1.
  File "/mnt/scratch_dir/plaatvdr/Jovian/Snakefile", line 366, in __rule_SNP_calling
  File "/home/plaatvdr/envs/Jovian_master/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job SNP_calling since they might be corrupted:
data/scaffolds_filtered/4_S4_scaffolds_ge500nt.fasta.fai
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

See the log file below:

INFO [2019-10-25 14:46:08,446]: Using 12 threads with following basic args: lofreq call -d 20000 --no-default-filter -f data/scaffolds_filtered/4_S4_scaffolds_ge500nt.fasta data/scaffolds_filtered/4_S4_sorted.bam

INFO [2019-10-25 14:46:10,903]: Adding 157086 commands to mp-pool
Traceback (most recent call last):
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/bin/lofreq2_call_pparallel.py", line 746, in <module>
    main()
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/bin/lofreq2_call_pparallel.py", line 669, in main
    "##source=%s" % ' '.join(sys.argv))
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/bin/lofreq2_call_pparallel.py", line 174, in concat_vcf_files
    subprocess.check_call(cmd)
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/lib/python3.6/subprocess.py", line 286, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/lib/python3.6/subprocess.py", line 267, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/mnt/scratch_dir/plaatvdr/Jovian/.snakemake/conda/e0281965/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'lofreq'

Searching for this error on LoFreq's issues paged turned up the following issue CSB5/lofreq#79. Apparently, LoFreq has a hardcoded limit of only accepting 137072 contigs per sample. When I checked the number of trimmed scaffolds in this sample, it was 157086 contigs. So that is the cause of the problem.

The solution would be to write a checker that splits up files with more than 137072 contigs and later merging them back again. But it seems like such a corner-case that I'm giving it a low-priority.

A "work-around" would be to remove such samples from your analysis, at least then the entire Jovian analysis will finish. Another "work-around" would be to tweak the filtering parameters such that the number of contigs drops below the LoFreq limit, e.g. by increasing the minlen parameter (and thus filtering away more scaffolds).

Please, if you also encounter this error, mention it in this thread so I can reevaluate the priority.

The text was updated successfully, but these errors were encountered:

DennisSchmitz · 2019-10-29T13:22:23Z

Other samples in @RozemarijnVanDerPlaats's run have the same problem. It seem to happen in environmental samples (e.g. surface water) where it makes sense that there are a great many organisms that are so diluted as to not generate enough reads to assemble into bigger scaffolds.

This has never been a problem in the hundreds of clinical samples processed thus-far, nor do I expect it to be in the future. Still, it's sloppy and hinders broader usage.

I've asked for the data so I can test a solution when I've got the time for it.

DennisSchmitz added bug Something isn't working wontfix This will not be worked on labels Oct 25, 2019

DennisSchmitz self-assigned this Oct 25, 2019

DennisSchmitz pinned this issue Dec 5, 2019

DennisSchmitz mentioned this issue May 8, 2020

FAQ: LoFreq gives error "OSError: [Errno 7] Argument list too long: lofreq #143

Open

DennisSchmitz unpinned this issue Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNP_calling rule crashes for samples with more than 131072 contigs per sample due to a limitation in LoFreq #102

SNP_calling rule crashes for samples with more than 131072 contigs per sample due to a limitation in LoFreq #102

DennisSchmitz commented Oct 25, 2019

DennisSchmitz commented Oct 29, 2019

SNP_calling rule crashes for samples with more than 131072 contigs per sample due to a limitation in LoFreq #102

SNP_calling rule crashes for samples with more than 131072 contigs per sample due to a limitation in LoFreq #102

Comments

DennisSchmitz commented Oct 25, 2019

DennisSchmitz commented Oct 29, 2019