.. _file_based: =============================== File-Based Quantity Computers =============================== The :py:class:`~chemfit.file_based_computer.FileBasedQuantityComputer` runs an external command in a temporary working directory and parses the resulting output files into a quantity dictionary. This is the standard way to integrate external simulation codes into ChemFit. A file-based computer is constructed from three pieces: - a function that builds the command - a list of expected output files - a function that parses those files Minimal example ----------------- Consider an external script with a command-line interface, which does the following: 1. Accept an input :math:`A` 2. Compute :math:`y_i = A (x_i-2)^2` for a predefined range of :math:`x_i \in \left[ x_\text{min}, x_\text{max} \right]` 3. Write the resulting arrays :math:`y_i` and the corresponding :math:`x_i` to a file .. note:: The full script can be found in the unit tests at ``_. In this example we will use the :py:class:`~chemfit.file_based_computer.FileBasedQuantityComputer` to determine the pre-factor :math:`A`. Before we can start we should define how our external command can be called. For maximum flexibility, the command is provided as a function that accepts the parameter dictionary and the temporary working directory. Each evaluation runs in its own isolated working directory. All files created by the external command should be written relative to this working directory. The paths specified in ``output_files`` are interpreted relative to it as well. .. note:: The extra arguments, ``script_file`` and ``output_file``, need to be bound. In the end the computer will accept only a function whose only free arguments are the parameters and the working directory. In this example we will use the :py:meth:`~chemfit.file_based_computer.FileBasedQuantityComputer.with_cmd` utility method to help us out with this. .. code-block:: python # Define the command that will be called to create the output file with given parameters def callable_cmd( parameters: dict[str, float], workdir: Path, script_file: Path, output_file: Path ) -> list[str]: return f"python {script_file} {parameters['prefactor']} {output_file}".split() Next, we need to define a parser that converts the generated output file(s) into quantities. For our example, we could define such a parser like so: .. code-block:: python import numpy as np def my_output_parser(output_files: list[Path]) -> dict[str, Any]: """Parse the output files and retrieve the quantities.""" f = output_files[0] data = np.loadtxt(f) return {"y": data[:, 0], "x": data[:, 1]} .. note:: As you can see :py:func:`my_output_parser` *has* to accept a list of output files. In this simple example, we do not have to worry about this, since we know there will only ever be one output file. The reason for the list is that the :py:class:`~chemfit.file_based_computer.FileBasedQuantityComputer` may specify multiple output files and, in fact, multiple parsers. All output files are passed to all parsers, and their outputs are merged. We will also need the following loss function .. code-block:: python def loss_function(quantities: dict[str, Any], ref_y: Iterable[float]) -> float: y_values = quantities["y"] errors = [(y - y_r) ** 2 for y, y_r in zip(y_values, ref_y)] return np.sum(errors) Now we're ready to wire everything up: .. code-block:: python ob = ( FileBasedQuantityComputer( output_files=["output.txt"], output_parsers=[my_output_parser], base_working_directory=".", delete_temp_workdirs=True, ) .with_cmd(callable_cmd, script_file=script_file, output_file="output.txt") .with_loss(loss_function, ref_y=ref_quantities["y"]) ) initial_guess = {"prefactor": 0.01} fitter = Fitter(ob, initial_params=initial_guess) opt_params = fitter.fit_scipy() The entire example can be found in the tests. What happens during evaluation ------------------------------ Each evaluation runs in an isolated working directory. A single call performs the following steps: 1. create a temporary working directory 2. run ``presubmit_hook`` (if provided) 3. build the command via ``executable_cmd`` 4. execute it using :py:func:`subprocess.run` 5. wait until all expected output files exist 6. parse them using ``output_parsers`` 7. return the resulting quantity dictionary The working directory is removed after evaluation unless configured otherwise. Customization points -------------------- The behavior is controlled entirely through callables. Command construction ^^^^^^^^^^^^^^^^^^^^ ``executable_cmd`` receives the parameter dictionary and the current workdir and must return a command (list of strings): .. code-block:: python def executable_cmd(parameters : dict[str,Any], workdir : Path): return ["my_program", "--x", str(parameters["x"])] This function is called for every evaluation. Output files ^^^^^^^^^^^^ ``output_files`` defines which files must exist before parsing begins. .. code-block:: python output_files = [Path("energy.txt"), Path("forces.txt")] All paths must be **relative to the working directory**. Output parsing ^^^^^^^^^^^^^^ ``output_parsers`` receives the list of output file paths and returns a dictionary of quantities: .. code-block:: python def output_parsers(paths): energy = float(paths[0].read_text()) return {"energy": energy} Presubmit hook -------------- If input files need to be written before execution, use ``presubmit_hook``: .. code-block:: python def presubmit_hook(parameters:dict[str,Any], workdir:Path): with open("input.txt", "w") as f: f.write(str(parameters["x"])) This runs inside the working directory before the command is executed. Example: generating an input file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A common use of ``presubmit_hook`` is to generate input files from a template. .. code-block:: python def write_input( parameters: dict[str, Any], workdir: Path, *, template_path: Path, output_name: str, ): template = template_path.read_text() content = template.replace("{{A}}", str(parameters["prefactor"])) output_path = workdir / output_name output_path.write_text(content) This can then be attached to the computer: .. code-block:: python computer = ( FileBasedQuantityComputer( output_files=[Path("output.txt")], output_parsers=[my_output_parser], ) .with_presubmit( write_input, template_path=Path("template.in"), output_name="input.in", ) .with_cmd(callable_cmd, script_file="square.py", output_file="output.txt") ) The presubmit hook runs inside the working directory before the command is executed. This makes it the right place to prepare all input files needed by the external program. .. hint:: Using a template engine such as ``Jinja`` to generate input files can be a very powerful option in the ``presubmit_hook``, especially when many files need to be configured or share common structure. Important rules --------------- Output files must be relative ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All paths in ``output_files`` must be relative to the working directory. Using absolute paths breaks isolation and can lead to incorrect results when running in parallel. Existence vs completeness ^^^^^^^^^^^^^^^^^^^^^^^^^ An output file is considered ready as soon as it exists. The framework does not check whether the file is fully written. If your program writes files incrementally, ensure that files only appear once complete, or use a separate completion flag file. Scheduler caveat ^^^^^^^^^^^^^^^^ Some commands (e.g. ``srun`` or ``sbatch``) return before the computation has finished. In that case, output files may appear before the job is done. A common solution is to write a ``done`` file and include it in ``output_files``. Debugging and failure handling ------------------------------ Temporary working directories are deleted after successful execution. For debugging, you can keep them: .. code-block:: python computer = FileBasedQuantityComputer( ..., keep_temp_workdir_after_crash=True, ) To inspect failures, you can also enable dump files: .. code-block:: python computer = FileBasedQuantityComputer( ..., write_dump_file_after_crash=True, ) During execution, useful information is stored in the context, including: - the working directory - the executed command - the output files Execution options ----------------- The constructor exposes additional options: - ``base_working_directory`` - where temporary directories are created - ``wait_timeout`` - maximum time to wait for output files - ``poll_interval`` - how often file existence is checked - ``subprocess_run_args`` - arguments passed to ``subprocess.run`` - ``delete_temp_workdirs`` - whether to remove directories after success Subclassing ----------- In most cases, constructing a :py:class:`~chemfit.file_based_computer.FileBasedQuantityComputer` with callables is sufficient. Subclassing is useful when the execution flow itself needs to change. A typical example is adding a scheduler wrapper such as ``srun``: .. code-block:: python class SrunComputer(FileBasedQuantityComputer): def build_cmd(self, parameters, ctx): base_cmd = super().build_cmd(parameters, ctx) return ["srun", *base_cmd] This pattern is used when command construction depends on runtime context. Summary ------- :py:class:`~chemfit.file_based_computer.FileBasedQuantityComputer` provides a structured way to: - run external programs - isolate executions in temporary directories - collect results as dictionaries It is the main integration point for external simulation workflows in ChemFit.