snsxt package¶

Subpackages¶

Submodules¶

snsxt.cleanup module¶

Functions for cleaning up after an analysis is finished

snsxt.cleanup.analysis_complete(analysis)[source]¶

Actions to take after an analysis is done

Parameters:	analysis (SnsWESAnalysisOutput) – object representing output from an sns wes analysis pipeline output on which to run downstream analysis tasks

snsxt.cleanup.save_configs(analysis_dir)[source]¶

Saves the global configs object to a YAML file in the analysis dir

Parameters:	analysis_dir (str) – path to a directory to hold the analysis output

Notes

Some config items are added or modified during program run time, so final configs may not exactly match starting configs set in external config YAML files

snsxt.job_management module¶

Functions for custom management of compute cluster qsub jobs

snsxt.job_management.background_jobs = []¶: If an analysis task generated qsub jobs, but did not wait for them to finish, they will be captured in this list and will be monitored to completion when run_tasks finishes running all tasks. This way, the program will not exit until all jobs created have finished.

snsxt.job_management.kill_background_jobs()[source]¶: Kills all jobs in the background_jobs

snsxt.job_management.monitor_validate_background_jobs()[source]¶: Monitors the global background_jobs until completion, then validates their completion status.

snsxt.job_management.monitor_validate_jobs(jobs)[source]¶

Monitors a list of qsub jobs until completion, then validates their completion status.

Parameters:	jobs (list) – a list of of `qsub.Job` objects

snsxt.mail module¶

Sends email output of the pipeline results

snsxt.mail.check_default_address(address, server, default_key='__self__')[source]¶

Checks if the provided address matches the default_key, and if so, returns a default email address made from the username of the user running the program + the server.

Parameters:	address (str) – email address(es) in the format 'email1@server.com,email2@server.com’ server (str) – email server to use for a default email address default_key (str) – value to use for recognizing when a default address should be returned
Returns:	either the original `address` string, or an email address composed of the user’s system name + `server`
Return type:	str

snsxt.mail.email_error_output(message_file, *args, **kwargs)[source]¶

Sends an email in the event that errors occurred during the analysis.

Keyword Arguments:
Parameters:	message_file (str) – path to a file to use as the body of the email, typically the program’s log file
	subject_line (str) – the subject line that should be used for the email recipient_list (str) – the recipients for the email, in the format ``recipient_list = “user1@server.com,user2@server.com” ``

snsxt.mail.email_files = []¶

This list should contain file paths output by analysis tasks for inclusion as email attachments at the end of a successful analysis pipeline. It should be accessed by other parts of the program external to this module

Examples

Example usage:

task_output_file = 'foo.txt'
mail.email_files.append(task_output_file)

snsxt.mail.email_output(message_file, *args, **kwargs)[source]¶

Sends an email upon the successful completion of the analysis pipeline. If any email_files were set by the program while running, they will be validated and included as email attachments.

Keyword Arguments:
Parameters:	message_file (str) – path to a file to use as the body of the email, typically the program’s log file args (list) – a list containing extra args to pass to `email_output()` kwargs (dict) – a dictionary containing extra args to pass to `email_output()`
	recipient_list (str) – the recipients for the email, in the format ``recipient_list = “user1@server.com,user2@server.com” `` reply_to (str) – email address to use in the ‘Reply To’ field of the email subject_line (str) – the subject line that should be used for the email

snsxt.mail.sns_start_email(analysis_dir, **kwargs)[source]¶

Emails the user when the sns pipeline starts

Parameters:	analysis_dir (str) – path to a directory to hold the analysis output kwargs (dict) – dictionary containing extra args to pass to run_tasks

snsxt.mail.validate_email_files()[source]¶: Makes sure all the items in the email_files list exist and are considered valid for inclusion in email output

Notes

Since the email output is sent by an external program such as mutt, it is important that file attachments be valid before attempting to include them, since it will be more difficult to ensure that the email is sent successfully.

snsxt.run module¶

Runs a series of analysis tasks

Originally designed as an extension to the sns pipeline output, with the flexibility of added ad hoc extra analysis tasks for downstream processing

snsxt.run.configs = {'analysis_id_file': 'analysis_id.txt', 'tasks_config_dir': 'config', 'report_compile_script': '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/snsxt/compile_RMD_report.R', 'GATK_summary_file': 'VCF-GATK-HC-annot.all.txt', 'LoFreq_summary_file': 'VCF-LoFreq-annot.all.txt', 'MuTect2_annot_file': 'VCF-MuTect2-annot.all.txt', 'results_id_file': 'results_id.txt', 'sns_repo_dir': '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/snsxt/sns', 'MuTect2_summary_file': 'summary.VCF-MuTect2-annot.csv', 'tasks_files_dir': 'files', 'notification_recipients': '__self__', 'sns_route': 'wes', 'main_report': '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/snsxt/report/analysis_report.Rmd', 'snsxt_parent_dir': '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest', 'tasks_reports_dir': 'reports', 'success_recipients': '__self__', 'snsxt_dir': '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/snsxt', 'report_dir': '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/snsxt/report', 'samples_fastq_raw_file': 'samples.fastq-raw.csv', 'samples_pairs_file': 'samples.pairs.csv', 'reply_to_server': 'nyumc.org', 'extra_handlers': [<logging.FileHandler object>, <logging.FileHandler object>], 'success_subject_line_base': '[NGS580] [Success]', 'error_recipients': '__self__', 'mail_files': ['RunParameters.xml', 'RunParameters.txt', 'summary-combined.wes.csv'], 'GATK_HC_annot_file': 'summary.VCF-GATK-HC-annot.csv', 'notification_subject_line_base': '[NGS580] [Update]', 'tasks_scripts_dir': 'scripts', 'Strelka_annot_file': 'VCF-Strelka-annot.all.txt', 'email_recipients': 'kellys04@nyumc.org', 'summary_combined_file': 'summary-combined.wes.csv', 'Strelka_summary_file': 'summary.VCF-Strelka-annot.csv', 'sns_pairs_route': 'wes-pairs-snv', 'error_subject_line_base': '[NGS580] [Error]', 'tasks_sns_repo_dir': 'sns', 'LoFreq_annot_file': 'summary.VCF-LoFreq-annot.csv', 'report_files': ['report_tools.R', 'report_config.yml', 'report_styles.css', 'summary_report.Rmd', 'variant_report.Rmd', 'paired_variant_report.Rmd']}¶: The main configurations dictionary to use for settings throughout the program. The sns_repo_dir value is modified at program run time, by preprending the snsxt_dir path (path to this script’s directory). Other dict keys are set at program run time as well, including snsxt_parent_dir, snsxt_dir, and extra_handlers

snsxt.run.default_probes = '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/probes.bed'¶: A .bed formatted file to use by default for CNV analysis. Must have only 3 tab-delimited columns.

snsxt.run.default_targets = '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/targets.bed'¶: A .bed formatted file to use by default as the target regions for variant calling

snsxt.run.default_task_list = '/home/docs/checkouts/readthedocs.org/user_builds/snsxt/checkouts/latest/task_lists/default.yml'¶: The YAML formatted task list containing analysis tasks to be run by default

snsxt.run.email_logpath()[source]¶

Returns the path to the email log file; needed by the logging.yml config file

This generates dynamic output log file paths & names

Returns:	a Python logging FileHandler object configured with a log file path set dynamically at program run time
Return type:	logging.FileHandler

snsxt.run.extra_handlers = [<logging.FileHandler object>, <logging.FileHandler object>]¶: Python logging Filehandlers to be passed throughout the program, in order to keep all submodules logging to the same file(s) set by logpath() and email_logpath()

snsxt.run.get_task_list(task_list_file)[source]¶

Reads the task_list from a YAML formatted file

Parameters:	task_list_file (str) – the path to a YAML formatted file from which to read analysis tasks
Returns:	a dictionary containing the contents of the YAML task_list_file
Return type:	dict

snsxt.run.logpath()[source]¶

Returns the path to the main log file; needed by the logging.yml config file

This generates dynamic output log file paths & names

Returns:	a Python logging FileHandler object configured with a log file path set dynamically at program run time
Return type:	logging.FileHandler

snsxt.run.main(**kwargs)[source]¶

Main control function for the program

Parameters:

kwargs (dict) – dictionary containing args to run the program, expected to be passed from parse() and passed on to run_sns_tasks() and run_sns_tasks()

Keyword Arguments:

analysis_id (str) – an identifier for the analysis (e.g. the NextSeq run ID)
results_id (str) – a sub-identifier for the analysis (e.g. a timestamp)
task_list_file (str) – the path to a YAML formatted file containing analysis tasks to be run
debug_mode (bool) – prevents the program from halting if errors are found in qsub log output files; defaults to False. True = do not stop for qsub log errors, False = stop if errors are found
fastq_dirs (list) – a list of paths to directories to use as input data locations for a new sns analysis. These directories should contain .fastq.gz files within two levels from the top level of the dir (e.g. at most 2 subdirs deep). The .fastq.gz files contained in these directories should keep the exact filenames output by the NextSeq; sample parsing will take place automatically.
targets_bed (str) – path to a .bed formatted file to use as the target regions for variant calling
probes_bed (str) – path to a .bed formatted file to use as the probes for CNV analysis
pairs_sheet (str) – path to a .csv samplesheet to use for matching tumor and normal samples in the paired variant calling analysis steps. See GitHub for example.

snsxt.run.parse()[source]¶

Runs the program based on CLI arguments. arg parsing happens here, if program was run as a script

Returns:	a dictionary of keyword arguments to pass to main()
Return type:	dict

Examples

Example script usage:

snsxt$ snsxt/run.py -d mini_analysis-controls/ -f mini_analysis-controls/fastq/ -a mini_analysis -r results1 -t task_lists/dev.yml --pairs_sheet mini_analysis-controls/samples.pairs.csv_usethis

snsxt.run.startup()[source]¶: Configures global attributes of other modules, and performs other actions, when the program starts up

snsxt.setup_report module¶

Sets up and compiles the parent analysis report for the pipeline output

snsxt.setup_report.compile_RMD_report(input_file)[source]¶

Compiles a .Rmd format report using the R script set in the configs.

Returns:	the `tools.SubprocessCmd` object for the shell command that was run to execute the report compilation script
Return type:	SubprocessCmd

snsxt.setup_report.get_main_report_file()[source]¶

Gets the path to the main parent report .Rmd file which should be used to compile the analysis report.

Returns:	the path to the parent .Rmd file to use in compiling the report
Return type:	str

snsxt.setup_report.get_report_files()[source]¶

Gets the supporting files for the parent analysis report based on the configs. These include files with helper functions, sub-reports, etc.

Returns:	a list of paths to files that should be used to set up the parent analysis report
Return type:	list

snsxt.setup_report.setup_report(output_dir, analysis_id=None, results_id=None)[source]¶: setup the main analysis report in the analysis directory by copying over every associated file for the report to the output dir

snsxt.test module¶

Runs all the unit tests found throughout the program

snsxt.validation module¶

Functions for validating aspects of the pipeline

snsxt.validation.background_output_files = []¶: By default, a task will validated its expected output files upon task completion. However, tasks that submit qsub jobs and do not wait for them to complete will not be able to validate their expected output files. Instead, the paths to those expected files will be collected in this list, and they will be evaluated once all qsub jobs have been monitored to completion and validated.

snsxt.validation.validate_background_output_files()[source]¶: Validates the global background_output_files list contents.

snsxt.validation.validate_items(items)[source]¶

Runs validations on a list of items

Parameters:	items (list) – a list of file or dir paths to be validated