Add --incremental option to rdmd #108

mdparker · 2023-06-21T16:26:48Z

andrei reported this on 2013-03-09T06:47:47Z

Transfered from https://issues.dlang.org/show_bug.cgi?id=9673

CC List

bugzilla
bus_dbugzilla
code
dlang-bugzilla
razvan.nitu1305

Description

Currently rdmd follows the following process for building:

Fetch the main file (e.g. main.d) from the command line
Compute transitive dependencies for main.d and cache them in a main.deps file in a private directory. This computation is done only when dependencies change the main.deps file gets out of date.
Build an executable passing main.d and all of its dependencies on the same command line to dmd

This setup has a number of advantages and disadvantages. For large projects built of relatively independent parts, an --incremental option should allow a different approach to building:

Fetch the main file (e.g. main.d) from the command line
Compute transitive dependencies for main.d and cache them in a main.deps file in a private directory
For each such discovered file compute its own transitive dependencies in a worklist approach, until all dependencies of all files in the project are computed and cached in one .deps file for each .d file in the project. This computation shall be done only when dependencies change and some .deps files get out of date.
Invoke dmd once per .d file, producing object files (only for object files that are out of date). Invocations should be runnable in parallel, but this may be left as a future enhancement.
Invoke dmd once with all object files to link the code.

The added feature should not interfere with the existing setup. Users should compare and contrast the two approaches just by adding or removing --incremental in the rdmd command line.

Comments

dlang-bugzilla commented on 2013-03-09T21:06:31Z

*** Issue 4686 has been marked as a duplicate of this issue. ***

code commented on 2013-03-11T09:16:31Z

(In reply to comment 0)

Invoke dmd once per .d file, producing object files (only for object files
that are out of date). Invocations should be runnable in parallel, but this may
be left as a future enhancement.

It should cluster the source files by common dependencies so to avoid the parsing and semantic analysis overhead of the blunt parallel approach. I think a simple k-means clustering would suffice for this, k would be the number of parallel jobs.

dlang-bugzilla commented on 2013-03-11T09:22:20Z

How would it matter? You still need to launch the compiler one time per each source file with the current limitations.

code commented on 2013-03-11T09:30:48Z

You save the time by invoking "dmd -c" k times with each cluster.

dlang-bugzilla commented on 2013-03-11T09:35:29Z

Martin, I think you're missing some information. Incremental compilation is currently not reliably possible when more than one file is passed to the compiler at a time. Please check the thread on the newsgroup for more discussion on the topic.

code commented on 2013-03-11T09:52:41Z

(In reply to comment 5)
We should fix Bug 9571 et.al. rather than using them as design constraints.
Of course we'll have to do single invocation as a workaround.
All I want to contribute is an idea how to optimize rebuilds.

dlang-bugzilla commented on 2013-03-11T10:14:49Z

(In reply to comment 6)

(In reply to comment 5)
We should fix Bug 9571 et.al.

Issue 9571 describes a problem with compiling files one at a time.

rather than using them as design constraints.
Of course we'll have to do single invocation as a workaround.

Yes.

All I want to contribute is an idea how to optimize rebuilds.

I think sorting the file list (incl. path) is a crude but simple approximation of your idea, assuming the project follows sensible conventions for package structure.

andrei commented on 2013-03-11T10:19:42Z

(In reply to comment 2)

(In reply to comment 0)

Invoke dmd once per .d file, producing object files (only for object files
that are out of date). Invocations should be runnable in parallel, but this may
be left as a future enhancement.

It should cluster the source files by common dependencies so to avoid the
parsing and semantic analysis overhead of the blunt parallel approach. I think
a simple k-means clustering would suffice for this, k would be the number of
parallel jobs.

Great idea, although we'd need to amend things. First, the graph is directed (not sure whether k-means clustering is directly applicable to directed graphs, a cursory search suggests it doesn't).

Second, for each node we don't have the edges, but instead all paths (that's what dmd -v generates). So we can take advantage of that information. A simple thought is to cluster based on the maximum symmetric difference between module dependency sets, i.e. separately compile modules that have the most mutually disjoint dependency sets.

Anyhow I wouldn't want to get too bogged down into details at this point - first we need to get the appropriate infrastructure off the ground.

code commented on 2013-03-11T11:55:25Z

(In reply to comment 8)

Great idea, although we'd need to amend things. First, the graph is directed
(not sure whether k-means clustering is directly applicable to directed graphs,
a cursory search suggests it doesn't).

I didn't thought about graph clustering.

Second, for each node we don't have the edges, but instead all paths (that's
what dmd -v generates). So we can take advantage of that information. A simple
thought is to cluster based on the maximum symmetric difference between module
dependency sets, i.e. separately compile modules that have the most mutually
disjoint dependency sets.

That's more of what I had in mind. I'd use k-means to minimize the differences between the dependency sets of each module and the module set of their centroids.

Anyhow I wouldn't want to get too bogged down into details at this point -
first we need to get the appropriate infrastructure off the ground.
Right, but I'm happy to experiment with clustering once this is done.

code commented on 2013-07-19T16:31:36Z

Kind of works, but there are not many independent clusters in phobos.
https://gist.github.com/dawgfoto/5747405

A better approach might be to optimize for even cluster sizes, e.g. trying to split 100KLOC into 4 independent clusters of 25KLOC. The number of lines here are sources+imports. Assignment of source files to clusters could then be optimized with simulated annealing or so.

mdparker added P4 enhancement labels Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --incremental option to rdmd #108

Add --incremental option to rdmd #108

mdparker commented Jun 21, 2023