How can I remove (non-trivial) duplicates from a VCF file?Are duplicate variants against the VCF standard?How...
How should I handle players who ignore the session zero agreement?
Why is working on the same position for more than 15 years not a red flag?
How to make ice magic work from a scientific point of view?
Is there any risk in sharing info about technologies and products we use with a supplier?
How can a large fleets maintain formation in interstellar space?
Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?
Words and Words with "ver-" Prefix
What sets the resolution of an analog resistive sensor?
Is using an 'empty' metaphor considered bad style?
False written accusations not made public - is there law to cover this?
Why would space fleets be aligned?
Square Root Distance from Integers
Can we harness potential energy?
Does every functor from Set to Set preserve products?
Why are the books in the Game of Thrones citadel library shelved spine inwards?
Boss asked me to sign a resignation paper without a date on it along with my new contract
Early credit roll before the end of the film
What is the data structure of $@ in shell?
Am I a Rude Number?
Workflow Comment popup does not show up
Do authors have to be politically correct in article-writing?
Is Krishna the only avatar among dashavatara who had more than one wife?
A Missing Symbol for This Logo
A starship is travelling at 0.9c and collides with a small rock. Will it leave a clean hole through, or will more happen?
How can I remove (non-trivial) duplicates from a VCF file?
Are duplicate variants against the VCF standard?How to read structural variant VCF?How do I carry out an ancestry/admixture test on a single VCF file?How can I extract only insertions from a VCF file?How to manipulate a reference FASTA or bam to include variants from a VCF?How to represent a deletion at position 1 in a VCF file?Where can I get the population allele frequency vcf file?How to subset samples from a VCF file?Meaning of the FORMAT fields of the VCF file coming from GIAB projectHigh Phred Quality score VCF fileAre duplicate variants against the VCF standard?
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
$endgroup$
add a comment |
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
$endgroup$
add a comment |
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
$endgroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
vcf variation
vcf variation
edited 1 hour ago
terdon
asked 1 hour ago
terdon♦terdon
4,3701729
4,3701729
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
44 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
add a comment |
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
add a comment |
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
$endgroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
answered 1 hour ago
terdon♦terdon
4,3701729
4,3701729
add a comment |
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
44 mins ago
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
44 mins ago
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
edited 17 mins ago
answered 56 mins ago
Daniel StandageDaniel Standage
2,333329
2,333329
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
44 mins ago
add a comment |
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
44 mins ago
1
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by
freebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.$endgroup$
– terdon♦
44 mins ago
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by
freebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.$endgroup$
– terdon♦
44 mins ago
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown