Here are my 12 guidelines for data exploration and analysis with the right attitude for discovery: 1. You never really finish analyzing a dataset. You just decide to stop and move on at some point, leaving some things undiscovered. 🧵
In the LLM-science discussion, I see a common misconception that science is a thing you do and that writing about it is separate and can be automated. I’ve written over 300 scientific papers and can assure you that science writing can’t be separated from science doing. Why? 1/18
1. Inspired by @BiophysicalFrog insightful comments about the lack of training in science in how to be a PI I wanted to share some thoughts about something that helped me a lot over the years in my own learning journey. This is a thread about PI Superpowers!
$ time bwa mem -t 32 -SP5M bwa-indexes/UCSC-hg38 \
SRR5266584_1.fastq.gz SRR5266584_2.fastq.gz \
> mapped.bwa.sort.bam
... (中略)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (1778, 3622, 6305)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 15359)
[M::mem_pestat] mean and std.dev: (4152.72, 2755.39)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 19886)
[M::mem_process_seqs] Processed 604710 reads in 302.831 CPU sec, 9.743 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 32 -SP5M bwa-indexes/UCSC-hg38 SRR5266584_1.fastq.gz SRR5266584_2.fastq.gz
[main] Real time: 1874.166 sec; CPU: 56598.868 sec
________________________________________________________
Executed in 31.24 mins fish external
usr time 928.66 mins 595.00 micros 928.66 mins
sys time 14.65 mins 231.00 micros 14.65 mins
$ time chromap --preset hic --remove-pcr-duplicates \
-t 32 -x chromap-indexes/UCSC-hg38 -r hg38.fa \
-1 SRR5266584_1.fastq.gz -2 SRR5266584_2.fastq.gz \
-o mapped.chromap.pairs
Preset parameters for Hi-C are used.
Start to map reads.
Parameters: error threshold: 4, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 1000, MAPQ-threshold: 1, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 32
Analyze bulk data.
Won't try to remove adapters on 3'.
Will remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Allow split alignment.
Output mappings in pairs format.
Reference file: hg38.fa
Index file: chromap-indexes/UCSC-hg38
1th read 1 file: SRR5266584_1.fastq.gz
1th read 2 file: SRR5266584_2.fastq.gz
Output file: mapped.chromap.pairs
Loaded all sequences successfully in 28.22s, number of sequences: 25, number of bases: 3088286401.
Kmer size: 17, window size: 7.
Lookup table size: 392946789, occurrence table size: 445474231.
Loaded index successfully in 112.23s.
Mapped 500000 read pairs in 5.55s.
(中略)
Mapped 500000 read pairs in 1.56s.
Mapped 100349 read pairs in 0.56s.
Mapped all reads in 304.95s.
Number of reads: 120200698.
Number of mapped reads: 108983560.
Number of uniquely mapped reads: 100363412.
Number of reads have multi-mappings: 8620148.
Number of candidates: 4829748453.
Number of mappings: 108983560.
Number of uni-mappings: 100363412.
Number of multi-mappings: 8620148.
Sorted, deduped and outputed mappings in 57.58s.
# uni-mappings: 53267577, # multi-mappings: 868789, total: 54136366.
Number of output mappings (passed filters): 47034561
Total time: 508.85s.
________________________________________________________
Executed in 511.87 secs fish external
usr time 100.27 mins 475.00 micros 100.27 mins
sys time 1.93 mins 234.00 micros 1.93 mins
$ time chromap --preset chip -t 24 \
$ -x chromap-UCSC-hg38 -r hg38.fa \
$ -1 SRR227359.fastq.gz -o SRR227359.chromap.bed
Preset parameters for ChIP-seq are used.
Start to map reads.
Parameters: error threshold: 8, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 2000, MAPQ-threshold: 30, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 24
Analyze bulk data.
Won't try to remove adapters on 3'.
Will remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Output mappings in BED/BEDPE format.
Reference file: hg38.fa
Index file: chromap-UCSC-hg38
1th read 1 file: SRR227359.fastq.gz
Output file: SRR227359.chromap.bed
Loaded all sequences successfully in 50.49s, number of sequences: 25, number of bases: 3088286401.
Kmer size: 17, window size: 7.
Lookup table size: 392946789, occurrence table size: 445474231.
Loaded index successfully in 174.56s.
Loaded 500000 reads in 1.01s.
(中略)
Loaded 314483 reads in 0.53s.
Mapped in 0.56s.
No more reads.
Mapped in 0.18s.
Mapped all reads in 30.17s.
Number of reads: 21814483.
Number of mapped reads: 12508635.
Number of uniquely mapped reads: 11368866.
Number of reads have multi-mappings: 1139769.
Number of candidates: 227477151.
Number of mappings: 12508635.
Number of uni-mappings: 11368866.
Number of multi-mappings: 1139769.
Sorted, deduped and outputed mappings in 8.99s.
# uni-mappings: 10833419, # multi-mappings: 1020796, total: 11854215.
Number of output mappings (passed filters): 9211499
Total time: 86.82s.
real 1m32.913s
user 4m12.370s
sys 0m27.555s
$ time chromap --preset chip -t 24 \
$ -x chromap-UCSC-hg38 -r hg38.fa \
$ -1 SRR227359.fastq.gz --SAM -o SRR227359.chromap.sam
Preset parameters for ChIP-seq are used.
Start to map reads.
(中略)
Mapped all reads in 62.61s.
Number of reads: 21814483.
Number of mapped reads: 12508635.
Number of uniquely mapped reads: 11368866.
Number of reads have multi-mappings: 1139769.
Number of candidates: 227477151.
Number of mappings: 12508635.
Number of uni-mappings: 11368866.
Number of multi-mappings: 1139769.
Sorted, deduped and outputed mappings in 31.94s.
# uni-mappings: 10859912, # multi-mappings: 1023343, total: 11883255.
Number of output mappings (passed filters): 9236028
Total time: 109.61s.
real 2m4.987s
user 6m24.888s
sys 0m48.419s
$ time chromap -l 2000 --low-mem --BED -t 24 \
$ -x chromap-UCSC-hg38 -r hg38.fa \
$ -1 SRR227359.fastq.gz --SAM -o SRR227359.chromap.include_dup.sam
Start to map reads.
Parameters: error threshold: 8, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 2000, MAPQ-threshold: 30, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 24
Analyze bulk data.
Won't try to remove adapters on 3'.
Won't remove PCR duplicates after mapping.
Will remove PCR duplicates at bulk level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
(中略)
Mapped all reads in 74.75s.
Number of reads: 21814483.
Number of mapped reads: 12508635.
Number of uniquely mapped reads: 11368866.
Number of reads have multi-mappings: 1139769.
Number of candidates: 227477151.
Number of mappings: 12508635.
Number of uni-mappings: 11368866.
Number of multi-mappings: 1139769.
Sorted and outputed mappings in 42.73s.
# uni-mappings: 11368866, # multi-mappings: 1139769, total: 12508635.
Number of output mappings (passed filters): 9685899
Total time: 144.08s.
real 2m29.591s
user 8m41.938s
sys 1m0.581s
$ time chromap -l 2000 --low-mem -q 0 --BED -t 24 \
$ -x chromap-UCSC-hg38 -r hg38.fa \
$ -1 SRR227359.fastq.gz --SAM -o SRR227359.chromap.include_dup.q0.sam
Mapped all reads in 78.28s.
Number of reads: 21814483.
Number of mapped reads: 12508635.
Number of uniquely mapped reads: 11368866.
Number of reads have multi-mappings: 1139769.
Number of candidates: 227477151.
Number of mappings: 12508635.
Number of uni-mappings: 11368866.
Number of multi-mappings: 1139769.
Sorted and outputed mappings in 47.17s.
# uni-mappings: 11368866, # multi-mappings: 1139769, total: 12508635.
Number of output mappings (passed filters): 12508635
Total time: 188.37s.
real 3m11.691s
user 8m59.719s
sys 1m8.940s