{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "# Scalable Job Tracking\n", "\n", "Launch this tutorial in a Jupyter Notebook on Binder: \n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/htcondor/htcondor-python-bindings-tutorials/master?urlpath=lab/tree/Scalable-Job-Tracking.ipynb)\n", "\n", "The Python bindings provide two scalable mechanisms for tracking jobs:\n", "\n", "* **Poll-based tracking**: The Schedd can be periodically polled\n", " through the use of `Schedd.query` to get job\n", " status information.\n", "* **Event-based tracking**: Using the job's *user log*, Python can\n", " see all job events and keep an in-memory representation of the\n", " job status.\n", "\n", "Both poll- and event-based tracking have their strengths and weaknesses; the\n", "intrepid user can even combine both methodologies to have extremely reliable,\n", "low-latency job status tracking.\n", "\n", "In this module, we outline the important design considerations behind each\n", "approach and walk through examples.\n", "\n", "## Poll-based Tracking\n", "\n", "Poll-based tracking involves periodically querying the schedd(s) for jobs of interest.\n", "We have covered the technical aspects of querying the Schedd in prior tutorials.\n", "Beside the technical means of polling, important aspects to consider are *how often*\n", "the poll should be performed and *how much* data should be retrieved.\n", "\n", "**Note**: When `Schedd.query` is used, the query will cause the schedd to fork\n", "up to ``SCHEDD_QUERY_WORKERS`` simultaneous workers. Beyond that point, queries will\n", "be handled in a non-blocking manner inside the main ``condor_schedd`` process. Thus, the\n", "memory used by many concurrent queries can be reduced by decreasing ``SCHEDD_QUERY_WORKERS``.\n", "\n", "A job tracking system should not query the Schedd more than once a minute. Aim to minimize the\n", "data returned from the query through the use of the projection; minimize the number of jobs returned\n", "by using a query constraint. Better yet, use the ``AutoCluster`` flag to have `Schedd.query`\n", "return a list of job summaries instead of individual jobs.\n", "\n", "Advantages:\n", "\n", "* A single entity can poll all ``condor_schedd`` instances in a pool; using `htcondor.poll`,\n", " multiple Schedds can be queried simultaneously.\n", "* The tracking is resilient to bugs or crashes. All tracked state is replaced at the next polling\n", " cycle.\n", "\n", "Disadvantages:\n", "\n", "* The amount of work to do is a function of the number of jobs in the schedd; may scale poorly\n", " once more than 100,000 simultaneous jobs are tracked.\n", "* Each job state transition is not seen; only snapshots of the queue in time.\n", "* If a job disappears from the Schedd, it may be difficult to determine why (Did it finish? Was\n", " it removed?)\n", "* Only useful for tracking jobs at the minute-level granularity.\n", "\n", "\n", "## Event-based Tracking\n", "\n", "Each job in the Schedd can specify the ``UserLog`` attribute; the Schedd will atomically append a\n", "machine-parseable event to the specified file for every state transition the job goes through.\n", "By keeping track of the events in the logs, we can build an in-memory representation of the job\n", "queue state.\n", "\n", "Advantages:\n", "\n", "* No interaction with the ``condor_schedd`` process is needed to read the event logs; the job\n", " tracking effectively places no burden on the Schedd.\n", "* In most cases, the Schedd writes to the log synchronously after the event occurs. Hence, the\n", " latency of receiving an update can be sub-second.\n", "* The job tracking scales as a function of the event rate, not the total number of jobs.\n", "* Each job state is seen, even after the job has left the queue.\n", "\n", "Disadvantages:\n", "\n", "* Only the local ``condor_schedd`` can be tracked; there is no mechanism to receive the event\n", " log remotely.\n", "* Log files must be processed from the beginning, with no rotations or truncations possible.\n", " Large files can take a large amount of CPU time to process.\n", "* If every job writes to a separate log file, the job tracking software may have to keep an\n", " enormous number of open file descriptors. If every job writes to the same log file, the\n", " log file may grow to many gigabytes.\n", "* If the job tracking software misses an event (or an unknown bug causes the ``condor_schedd``\n", " to fail to write the event), then the job tracker may believe a job incorrectly is stuck\n", " in the wrong state.\n", "\n", "At a technical level, event tracking is implemented with the\n", "[htcondor.JobEventLog](https://htcondor.readthedocs.io/en/latest/apis/python-bindings/api/htcondor.html#htcondor.JobEventLog) class.\n", "\n", "```\n", ">>> jel = htcondor.JobEventLog(\"/tmp/job_one.log\")\n", ">>> for event in jel.events(stop_after=0):\n", "... print event\n", "```\n", "\n", "The return value of `JobEventLog.events()` is an iterator over\n", "[htcondor.JobEvent](https://htcondor.readthedocs.io/en/latest/apis/python-bindings/api/htcondor.html#htcondor.JobEvent)\n", "objects. The example above does not block." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 4 }