Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pkg + BinaryProvider #841

Closed
staticfloat opened this issue Oct 18, 2018 · 80 comments · Fixed by #1277
Closed

Pkg + BinaryProvider #841

staticfloat opened this issue Oct 18, 2018 · 80 comments · Fixed by #1277

Comments

@staticfloat
Copy link
Member

staticfloat commented Oct 18, 2018

The ponderous forms of Pkg and BinaryProvider slowly intermesh; Hulking behemoths merging their forms like waves from two separate oceans breaking upon the same shore. The silhouette of one blends seamlessly into the shadow of the other, a möbius strip of darkness and light, beginning and ending within itself.

Let's talk about the possible merging of BinaryProvider and Pkg, to integrate the binary installation story to unheard-of levels. Whereas:

  • Binary installation for us is now as simple as unpacking a tarball
  • Pkg knows how to unpack tarballs

I suggest that we do away with the weird indirection we currently have with packages using build.jl files to download tarballs, and instead integrate these downloads into Pkg completely. This implies that we:

  • Create a new concept within Pkg, that of a Binary Artifact. The main difference between a Binary Artifact and a Package is that Packages are platform-independent, Binary Artifacts are necessarily not so. We would need to load over the same kind of platform-matching code as is in BP right now, e.g. dynamically choosing the most specific matching tarball based on the currently running Julia. (See choose_download() within BP for more).

  • Modify BinaryBuilder output to generate Binary Artifacts that are then directly imported into the General Registry. The Binary Artifacts contain within them a small amount of Julia code; things like setting environment variables, mappings from LibraryProduct to actual .so file, functions to run an ExecutableProduct, etc... This is all auto-generated by BinaryBuilder.

  • Change client packages to simply declare a dependency upon these Binary Artifacts when they require a library. E.g. FLAC.jl would declare a dependency upon FLAC_jll, which itself declares a dependency upon Ogg_jll, and so on and so forth.

  • Eliminate the Pkg.build() step for these packages, as the build will be completed by the end of the download step. (We can actually just bake the deps.jl file into the Binary Artifact, as we are using relative paths anyway)

Please discuss.

@staticfloat
Copy link
Member Author

Okay, let's get started on the first bullet point of this list; defining a BinaryArtifact type within Pkg. We need to create a new datatype within Pkg that represents not a Julia package, but a BinaryArtifact, which is distinct in the following ways:

  • BinaryArtifacts are chosen not only by version, but also by runtime-reflected properties (CPU architecture, OS, libgfortran version, etc....)
  • Allow packages to list BinaryArtifacts as something they require, complete with version bounds.
  • Provide an interface for BinaryArtifacts to either "export code" or "bundle metadata". Things like "LibFoo.jll exports the abspath location of libfoo.so", or a wrapper function that sets environment variables before invoking Git.jll's bundled git.exe.

@00vareladavid
Copy link
Contributor

I guess we can create an AbstractDependency type with PackageSpec and BinaryArtifact as subtypes? Then we replace most current occurrences of PackageSpec with AbstractDependency.

@00vareladavid
Copy link
Contributor

Is the idea to download a BinaryArtifact and then key into it with runtime information to determine what tarballs should be downloaded? Or is a BinaryArtifact the tarball itself?

@StefanKarpinski
Copy link
Member

How about just calling it Dependency since we're not going to have Dependency <: AbstractDependency, we're going to have PackageSpec, BinaryArtifact <: Dependency.

@00vareladavid
Copy link
Contributor

Ok, and theses types of nodes will be mostly indistinguishable until we hit what is currently build_versions. At which point, we key into them with runtime information(i.e. choose_download) to determine the exact tarball which needs to be set up. Is that roughly the plan?

@staticfloat
Copy link
Member Author

Sounds reasonable to me; I'd be happy to discuss this further and nail down more of an implementation plan during the Pkg call tomorrow?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jan 15, 2019

Version constraints are against the version of the library, not the version of the thing that builds the library. But you want to be able to lock down a specific build of a library. But a specific build is completely platform-specific. There are some layers of versioning:

  1. Artifact identity. The exact identity of the binary artifact that was used in a configuration. We want to record this or be able to reconstruct it somehow, but it's too specific.
  2. Build script version. The version of the build scripts that produces that binary artifact. This will typically support multiple different platforms. This is probably what should be in the manifest.
  3. Library version. The version of the library that the build script is building. This is what compatibility constraints should work with.

Is this correct and complete? The artifact identity should be completely determined by some "system properties" tuple that captures all the things that determine which artifact generated by a build script one needs. The end user mostly only needs to care about the library version, which is what determines its API and therefore usage. There might, however, be situations where one needs compatibility constraints on both the library version and the build script version: e.g. an older build was configured in some way that makes the resulting artifact unusable in certain ways.

@StefanKarpinski
Copy link
Member

Does a given version of a build script always produce just a single version of a given library?

@stevengj
Copy link
Member

stevengj commented Jan 22, 2019

How would this work with packages that use BinaryProvider but fall back to compiling from source if a working binary is not available (typically for less-popular Linux distros)? e.g. ZMQ or Blosc IIRC. You need some kind of optional-dependency support, it seems, or support for a source “platform”.

@staticfloat
Copy link
Member Author

For building from source, we will support it manually by allowing users to dev a jll package, then they just need to copy their .so files into that directory. This is analogous to allowing users to modify their .jl files within a dev'ed Julia package.

I do not think we should ever build from source automatically. Looking at ZMQ, it looks like you have full platform coverage; under what circumstances are you compiling?

@ararslan
Copy link
Member

Another example to add to Steven's list is SpecialFunctions, which falls back to BinDeps when a binary isn't available from BinaryProvider. Once upon a time that was used on FreeBSD, before we had FreeBSD support in BinaryProvider, but now I don't know when it's used aside from on demand on CI.

@stevengj
Copy link
Member

stevengj commented Jan 24, 2019

Looking at ZMQ, it looks like you have full platform coverage; under what circumstances are you compiling?

We needed it on CentOS, for example (JuliaInterop/ZMQ.jl#176), because of JuliaPackaging/BinaryBuilder.jl#230.

There are an awful lot of Unix flavors out there, and it's nice to have a compilation fallback.

@StefanKarpinski
Copy link
Member

Regardless of the many UNIX variations, the only things you really need are the right executable format and the right libc, which we can pretty much cover at this point.

@stevengj
Copy link
Member

And the right libstdc++, which is apparently harder to cover.

(This was why I had to enable source builds for ZMQ and Blosc. Are we confident that this is fixed, or are we happy to go back to breaking installs for any package that calls a C++ library?)

@staticfloat
Copy link
Member Author

I think our libstdc++ problems should be largely solved now that JuliaPackaging/BinaryBuilder.jl#253 has been merged. We now build with GCC 4.8.5 by default, using a libstdc++ version of 3.4.18, so we are guaranteed to work with anything at least newer than that. I'm not entirely sure it's possible to build Julia with GCC earlier than 4.8 at the moment, (the Julia README still says GCC 4.7+, but I'm pretty sure LLVM requires GCC 4.8+) so this seems like a pretty safe bet to me. I would be eager to hear how users are running Julia with a version of libstdc++ older than 3.4.18.

@stevengj
Copy link
Member

Should JuliaPackaging/BinaryBuilder.jl#230 be closed then?

@staticfloat
Copy link
Member Author

Yes I think so.

@Petr-Hlavenka
Copy link

I'm very supportive in managing the binary artifacts by Pkg. I'd just like to point out that the implementation of library loading should be flexible enough to include some strategy for AOT compilation and deployment (to a different computer). The app deployed to a different computer will have to load libraries from different locations and the hardcoding of paths in deps.jl makes this pretty difficult, see JuliaPackaging/BinaryProvider.jl#140. The best way would be either not have deps.jl at all or no need to store absolute path to the library.

@StefanKarpinski
Copy link
Member

Yes, that's the plan: you declare what you need, referring to it by platform-independent identity instead of generating it explicitly and then hardcoding its location, instead allowing Pkg to figure out the best way to get you what you need and telling you where it is.

@staticfloat
Copy link
Member Author

staticfloat commented Mar 7, 2019

Progress! There is some code behind this post, and other things remain vaporware, with the aspiration of striking up some discussion on whether these are the aesthetics we want.

  • Building a builder repository results now in the tarballs (typically uploaded to a GitHub release like this one) as well as an Artifact.toml. These currently look something like this:
name = "JpegTurbo_jll"
uuid = "7e164b9a-ae9a-5a84-973f-661589e6cf70"
version = "2.0.1"

[artifacts.arm-linux-gnueabihf]
hash = "45674d19e63e562be8a794249825566f004ea194de337de615cb5cab059e9737"
url = "https://github.com/JuliaPackaging/Yggdrasil/releases/download/JpegTurbo-v2.0.1/JpegTurbo.v2.0.1.arm-linux-gnueabihf.tar.gz"

    [artifacts.arm-linux-gnueabihf.products]
    djpeg = "bin/djpeg"
    libjpeg = "lib/libjpeg.so"
    libturbojpeg = "lib/libturbojpeg.so"
    jpegtran = "bin/jpegtran"
    cjpeg = "bin/cjpeg"

[artifacts.i686-w64-mingw32]
hash = "c2911c98f9cadf3afe84224dfc509b9e483a61fd4095ace529f3ae18d2e68858"
url = "https://github.com/JuliaPackaging/Yggdrasil/releases/download/JpegTurbo-v2.0.1/JpegTurbo.v2.0.1.i686-w64-mingw32.tar.gz"

    [artifacts.i686-w64-mingw32.products]
    djpeg = "bin/djpeg.exe"
    libjpeg = "bin/libjpeg-62.dll"
    libturbojpeg = "bin/libturbojpeg.dll"
    jpegtran = "bin/jpegtran.exe"
    cjpeg = "bin/cjpeg.exe"
...
  • My plan is to embed this file into the Registry in the same way that Project.toml files are embedded right now. Artifacts will be analogous to Project.toml files with the following similarities/differences:

    • They will contain Compat.toml, Deps.toml and Versions.toml entries, which will function exactly the same as a normal Registry entry, except that the downstream DAG of Artifacts can only contain other Artifacts; an Artifact cannot depend on a general Julia package, so in that sense the dependency links are restricted somewhat.
    • They will not contain Manifest.toml, Project.toml or Package.toml, only the afore-mentioned Artifact.toml. This is mostly for simplicity, I don't see why we need these, but I am aware that I may not be thinking this through completely.
  • Pkg is now binary platform-aware, by essentially gutting code from BinaryProvider to instead live inside of Pkg. This allows me to ask things like "what is the ABI-aware triplet of the currently-running host?" (you now get that by calling Pkg.triplet(Pkg.platform_abi_key())).

  • When the user expresses a dependency on one of these Artifact objects (e.g. through Pkg.add("LibFoo_jll")) it will get added to the dependency graph as usual, but when being concretized into a URL to be downloaded, an extra step of indirection is applied by reaching into the Artifact.toml's dictionary, finding dict["artifacts"][triplet(platform_abi_key())] and using the embedded entries as the url and hash to download and unpack into a directory somewhere.

  • After downloading and unpacking the binaries, Pkg will generate a wrapper Julia package that exposes an API to "get at" these files, so that client code (such as LibFoo.jl, the fictitious julia-code side of things) can use it in as natural a way as possible. Example generated Julia code:

# LibFoo_jll/src/LibFoo_jll.jl
# Autogenerated code, do not modify
module LibFoo_jll
using Libdl

# Chain other dependent jll packages here, as necessary
using LibBar_jll

# This is just the `artifacts` -> platform_key() -> `products` mappings embedded in `Artifact.toml` above
const libfoo = abspath(joinpath(@__DIR__, "..", "deps", "usr", "lib", "libfoo.so"))
const fooifier = abspath(joinpath(@__DIR__, "..", "deps", "usr", "bin", "fooifier"))

# This is critical, as it allows a dependency that `libfoo.so` has on `libbar.so` to be satisfied.
# It does mean that we pretty much never dlclose() things though.
handles = []
function __init__()
    # Explicitly link in library products so that we can construct a necessary dependency tree
    for lib_product in (libfoo,)
        push!(handles, Libdl.dlopen(lib_product))
    end
end
end

Example Julia package client code:

# LibFoo.jl/src/LibFoo.jl

import LibFoo_jll

function fooify(a, b)
    return ccall((:fooify, LibFoo_jll.libfoo), Cint, (Cint, Cint), a, b)
end
...

@StefanKarpinski
Copy link
Member

I like it in general. I'll have to think for a bit about the structure of the artifacts file. There's a consistent compression scheme used by Deps.toml and Compat.toml; we'll want to use the same compression scheme for the artifact data in the registry which somewhat informs how you want to structure the data in the file as well.

Do you think I think we'll eventually want to teach ccall about libraries so that we can just write ccall(:libfoo, ...) and have it know to find the LibFoo shared library? That seems like the nicest interface to this possible—just declare the dependency in your project file and ccall it with the right name and everything just works.

@staticfloat
Copy link
Member Author

That seems like the nicest interface to this possible—just declare the dependency in your project file and ccall it with the right name and everything just works.

I am actively shying away from teaching Pkg/Base too much about dynamic libraries; it's a deep rabbit hole. In this proposal I'm even not baking in the platform-specific library searching awareness (e.g. "look for libraries in bin on windows, lib elsewhere). I want to keep Pkg as simple as possible.

On the other hand, I would like it if dlopen() was able to tell me, for instance, that trying to use libqt on a Linux system that doesn't have X11 installed already isn't going to work. It would know this because it would try to dlopen("libqt.so") and fail, and it would inspect the dependency tree and notice that libx11.so was not findable. This is all possible with not much new code written, but it does mean that we need to bring in things like ObjectFile.jl into Base, and that's a lot of code.

It would be nice if we could do things like search for packages that contain libfoo.so. That's actually one advantage to listing everything out in the Artifact.toml within the registry like that.

@staticfloat
Copy link
Member Author

There's a consistent compression scheme used by Deps.toml and Compat.toml

I'm not entirely sure what you mean by this, but I will await your instruction. I have no strong opinions over the Artifact.toml organization, except for the vague feeling that I want to make it as small as possible to avoid bloating the registry and making things slow to download/install/parse/search.

@Petr-Hlavenka
Copy link

Petr-Hlavenka commented Mar 8, 2019

After downloading and unpacking the binaries, Pkg will generate a wrapper Julia package that exposes an API to "get at" these files, so that client code (such as LibFoo.jl, the fictitious julia-code side of things) can use it in as natural a way as possible. Example generated Julia code:

const libfoo = abspath(joinpath(@__DIR__, "..", "deps", "usr", "lib", "libfoo.so"))
const fooifier = abspath(joinpath(@__DIR__, "..", "deps", "usr", "bin", "fooifier"))

This automatic wrapper generation with const assigning the absolute path is exactly the the thing that prevents AOT with deployment to a different computer. So during AOT PackageCompiler will need to modify every single artifact_wrapper_jlpackage to get rid of the baked-in absolute path.

If the code is auto-generated, why cannot this functionality be part of some function or macro-call that would open the handles and generate the const paths on-the-fly? In that case PackageCompiler could just pre-collect all the artifact to a "deployment depot" and let the dlopen reach for this "configurable" path. Or would redefine this const-path generator for the AOT build.

And is the constantness of the lib path really necessarily for efficient ccall?

@staticfloat
Copy link
Member Author

I should have said will result in those package versions being broken if they are add'd

That's what I'm saying is wrong; you're saying "if I delete Foo that Bar depends on, then try to add Bar, it will fail because Foo is missing". That's not how Pkg works; when you want to install Bar, it will automatically install Foo because it knows that Foo is a dependency of Bar. That's how artifacts will work as well; all the installation happens at ] add time, not ] build time (we're explicitly moving away from being able to have mutable state; this means that everything needs to be installed by the time you finish the Pkg.add() operation).

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 13, 2019

Right, that is what I am saying will break.

Yeah, there's no good reason to support that.

I'm also having trouble coming up with realistic scenarios where you need to clean out packages but not artifacts or vice versa. But the operation proceeds in two fairly separate phases:

  1. Figure out which packages are no longer reference by any manifests and delete them.
  2. Figure out which artifacts are no longer reference by any installed packages and delete them.

You can do one or the other independently and not break things, or one then the other which should be the default and cleans up the most space.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 13, 2019

One thing that I really like about this new approach that occurred to me is that by not having artifacts inside of packages, it allows artifacts to live in different repos depots than packages do. So you could have a pre-installed system copy of an artifact that is used by one or more user-installed copies of a package. That's quite cool, and potentially useful, imo.

@oxinabox
Copy link
Contributor

oxinabox commented Jun 13, 2019

That's not how Pkg works; when you want to install Bar, it will automatically install Foo because it knows that Foo is a dependency of Bar. That's how artifacts will work as well;

Right, ok, I had the picture wrong in my head.
I thought artifacts would not resolve just like packages.
(since they don't have UUIDs or versions)
but I guess there is indeed nothing stopping that.

One thing that I really like about this new approach that occurred to me is that by not having artifacts inside of packages, it allows artifacts to live in different repos than packages do. So you could have a pre-installed system copy of an artifact that is used by one or more user-installed copies of a package. That's quite cool, and potentially useful, imo.

That is nice. The DataDeps way of doing the same is a bit scary and unsafe. and kind encourages being unsafe (will probably have to change it eventually. I am now super sold on this whole naming things using their SHA); DataDeps just uses the name.
But because the artifacts are identified by SHA (On futher thought, I assume the even after unpacking the SHA is available? Because it will be used as a folder name?)

Ok cool things are much clearer now.

@StefanKarpinski
Copy link
Member

(since they don't have UUIDs or versions)

They don't need UUIDs or versions because they're content-addressed. You don't really care if one libfoo is "the same artifact" as a different libfoo—they're either the same data or they aren't.

it allows artifacts to live in different repos than packages do.

Oops, I meant "depots" not "repos".

But because the artifacts are identified by SHA (On futher thought, I assume the even after unpacking the SHA is available? Because it will be used as a folder name?)

This comment was about keeping metadata about artifacts around after they're installed so that you know what the SHA etc. was. I'm not really sure about how to structure the thing that goes at ~/.julia/artifacts/libfoo/$slug: you want the actual artifact content somewhere but you also want a bit of metadata about it. This is complicated by the possibility that it is sometimes just a single file and sometimes a folder that we've extracted from an archive. @oxinabox, @staticfloat, do you guys have any thoughts about the structure of these? What would the layout be?

@StefanKarpinski
Copy link
Member

I'm removing the "speculative" label because this is getting pretty concrete at this point. Some updates from Slack discussion:

  • We should identify artifacts by their on-disk content, not the archive hash. After all, the former is the definitive thing that offers no wiggle room, whereas many different archives can produce the same on-disk archive. That means we should have a git-tree-sha1 field in each artifact stanza, much like we do in package manifest stanzas. We may want to think about how artifact stanzas mirror manifest stanzas in other ways as well.

  • As a corollary of the above, you can potentially have different ways of acquiring the same exact artifact—different download URLs, different archive hashes. I previously thought that we should keep metadata about artifacts somewhere with the artifact, but with this design change I'm not so sure. After all, the one true defining characteristic of an artifact is its tree hash and you can always recompute that from it on disk—and if the slug from that hash doesn't match, then you have a corrupted artifact that you shouldn't use anyway.

  • Maybe we want to keep a log of artifact downloads somewhere like ~/.julia/logs/artifact_usage.toml: a record of what package triggered the install of an artifact, whether it was already installed or not, where it would have been downloaded from, etc.

  • We still want to record a SHA256 hash of the downloaded, pre-extraction state of each artifact so that we can verify it before downloading it, but this is no longer how we identify it.

  • I'm still not fully clear on how we should do artifact variant selection. @staticfloat's platform string approach or my more verbose dict approach. This is one of the last things to be decided.

@simonbyrne
Copy link
Contributor

  • I'm still not fully clear on how we should do artifact variant selection. @staticfloat's platform string approach or my more verbose dict approach. This is one of the last things to be decided.

The advantage of the dict approach is that it is more extensible should additional keys be required in future.

@staticfloat
Copy link
Member Author

I’m very willing to use a dict based approach. There’s no inherent advantage to the string format other than compactness (and ability to fit within a filename) but living within the Artifact.toml, if we have access to richer data structures we should just use them.

@visr
Copy link
Contributor

visr commented Jun 14, 2019

Great work on the design. I want to bring up a point about build variants, that I was thinking about. Curious about your thoughts.

If I understand correctly, the LibFoo_jll binary variant that is selected is based on its version and on system properties only. Is there any other way for the user to pick a different build, that is not full manual dev mode? Or should they create a separate LibFoo_with_x_enabled_jll and fork LibFoo.jl, and change the Artifact.toml to use LibFoo_with_x_enabled_jll instead? A concrete example is for instance SQLite with the R*tree module enabled, which perhaps does not make sense as a default, but could be requested specifically in Artifact.toml of a project or package. Although you'd probably still want to use it through the same julia wrapper package (SQLite.jl), which would need to know that you want to use a different variant of the binary. Similarly, we could make a default GDAL install small with only the most commonly needed formats, but allow a user to explicitly request a large full variant instead (issue ref). Right now I don't see a way to do that rather than deving everything and putting all artifacts in manually. Not sure how big of a can of worms this is though.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 14, 2019

So latest sketch of the way Artifacts.toml will look:

[dataset-A]
git-tree-sha1 = "e445efb1f3e2bffc06e349651f13729e6f7aeaaf"
basename = "dataset-A.csv"

  [dataset-A.download]
  sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
  url = [ # multiple URLs to try
      "https://server1.com/path/to/dataset.csv",
      "https://server2.com/path/to/dataset.csv",
  ]

[nlp-model-1]
git-tree-sha1 = "dccae443aeddea507583c348d8f082d5ed5c5e55"
basename = "nlp-model-1.onnx"

  [[nlp-model-1.download]] # multiple ways to download
  sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
  url = "https://server1.com/nlp-model-1.onnx.gz"
  extract = "gzip" # decompress file

  [[nlp-model-1.download]]
  sha256 = "9f45411f32dcc332331ff244504ca12ee0b402e00795ab719612a46b7fb24216"
  url = "https://server2.com/nlp-model-1.onnx"

[[libfoo]]
git-tree-sha1 = "05d42b0044984825ae286ebb9e1fc38ed2cce80a"
os = "Linux"
arch = "armv7l"

  [libfoo.download]
  sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
  url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
  extract = ["gzip", "tar"] # outermost first or last?

[[libfoo]]
git-tree-sha1 = "c2dc12a509eec2236e806569120e72058579ba19"
os = "Windows"
arch = "i686"

  [libfoo.download]
  sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
  url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.zip"
  extract = "zip"

[[libfoo]]
git-tree-sha1 = "d633f5f44b06d810d75651a347cae945c3b7f23d"
os = "macOS"
arch = "x86_64"

  [libfoo.download]
  sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
  url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.xz"
  extract = ["xz", "tar"]

Some features of this sketch:

  • the git-tree-sha1 is the defining key of each artifact variant—it must be present
    • this is the tree hash of the final extracted artifact as it appears on disk
    • if there is already an artifact in the corresponding location, there is no need to reinstall the artifact
    • multiple different packages can use the same artifact and describe different ways to get it—the only thing that matters is the bits on disk, which is what this is a hash of; as long as those are the same, just use it, it doesn't matter how it got there
  • an optional basename key can be given for an artifact variant
    • if it is absent, the extracted artifact tree will be installed at ~/.julia/artifacts/$name/$slug
    • if it is present, the extracted artifact tree will be installed at ~/.julia/artifacts/$name/$slug/$basename
    • this is intended to handle situations where the name of the artifact file is significant to some consumer, e.g. a reader that expects a CSV file to have the .csv extension
    • this could also be handled by putting the basename part inside of artifact, but there may be cases where we want to download artifacts as-is and therefore cannot control their structure
  • top-level keys in artifact stanzas with multiple variants are variant selectors, os, arch, etc.
  • each artifact variant has one or more download stanzas which describe a way to get it
    • there can be one or more url values in a download stanza—this is just a shorthand for giving multiple identical download stanzas that only differ by URL since that will be a common case
    • download stanzas have a sha256 entry, which gives the SHA256 hash of the downloaded file; this may be different for different download methods for the same artifact since it may be archived or compressed differently; this hash allows checking download correctness before extracting.
    • download stanzas may have an extract entry which indicates how to extract the actual artifact tree from the download; it can be a string to indicate a single extraction step or an array of string to indicate a sequence of extraction steps; these can only be selected from a set of known extraction steps, e.g. tar, gz, bz2, xz, zip; by default, no extraction is performed

@StefanKarpinski
Copy link
Member

I'm not so sure if the basename bit is necessary or a good idea. Maybe it isn't—it does mean that all users of an artifact must not only agree on the git-tree-sha1 but also the basename, which gives me pause. Maybe this should be a feature of the download instead, e.g. prefix = "dataset-A.csv"?

@oxinabox
Copy link
Contributor

oxinabox commented Jun 14, 2019

Part of the download seems right.
If it was a tarball containing a CSV with that name,
that was untarballed
then that should be the same as a CSV that was downloaded
and then lost its name (because Base.download does not know how to negotiate names or the webserver was bad)
and then had its name put back in by postprocessing.
(probably not prefix though, maybelocalfilename?)
It should be mutually exclusing with extract.
So it would be nice to express both extact and the setting of the name as values for a single option

Edit: Oh but we migth want to allowed .csv.gz and have that be extracted to a .csv.
Still putting this into the realm of postfetch feels right.

@StefanKarpinski
Copy link
Member

Maybe call it basename but put it in the download section and have it mean that the download will be extracted to ~/.julia/artifacts/$name/$slug/$basename. The thing that's git tree hashed is the entire tree at ~/.julia/artifacts/$name/$slug, which in that situation would be $basename and whatever it contains. Updated sketch with this scheme:

[dataset-A]
git-tree-sha1 = "e445efb1f3e2bffc06e349651f13729e6f7aeaaf"

  [dataset-A.download]
  basename = "dataset-A.csv"
  sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
  url = [ # multiple URLs to try
      "https://server1.com/path/to/dataset.csv",
      "https://server2.com/path/to/dataset.csv",
  ]

[nlp-model-1]
git-tree-sha1 = "dccae443aeddea507583c348d8f082d5ed5c5e55"

  [[nlp-model-1.download]] # multiple ways to download
  basename = "nlp-model-1.onnx"
  sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
  url = "https://server1.com/nlp-model-1.onnx.gz"
  extract = "gzip" # decompress file

  [[nlp-model-1.download]]
  basename = "nlp-model-1.onnx"
  sha256 = "9f45411f32dcc332331ff244504ca12ee0b402e00795ab719612a46b7fb24216"
  url = "https://server2.com/nlp-model-1.onnx"

[[libfoo]]
git-tree-sha1 = "05d42b0044984825ae286ebb9e1fc38ed2cce80a"
os = "Linux"
arch = "armv7l"

  [libfoo.download]
  sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
  url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
  extract = ["gzip", "tar"] # outermost first or last?

[[libfoo]]
git-tree-sha1 = "c2dc12a509eec2236e806569120e72058579ba19"
os = "Windows"
arch = "i686"

  [libfoo.download]
  sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
  url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.zip"
  extract = "zip"

[[libfoo]]
git-tree-sha1 = "d633f5f44b06d810d75651a347cae945c3b7f23d"
os = "macOS"
arch = "x86_64"

  [libfoo.download]
  sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
  url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.xz"
  extract = ["xz", "tar"]

@oxinabox
Copy link
Contributor

I think we need more thought.

What is so base about basename about it?

It only should matter for things that are not tarballs or zips.
I am kinda think it shouldn't ever exist for other cases?

Or at least I am not sure what it will do in those cases.

Understanding more how in interacts with

extract = ["tar", "gz"]
Vs
For
extract = ["gz"] on a csv
Vs
extract = [] on a csv

Are we thinking that tarballs extract to become 1 folder and we the rename that folder?
Or are we thinking that tarballs become a collection of files?
I was thinking the latter, but now I think I am wrong?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 15, 2019

basename is just the traditional Unix name for the last part of a path. A better scheme for this would be good.

@StefanKarpinski
Copy link
Member

Idea: basename could be an extraction step, but I'm not sure how to express this. Rough attempt:

[dataset-A.download]
sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = [ # multiple URLs to try
   "https://server1.com/path/to/dataset.csv",
   "https://server2.com/path/to/dataset.csv",
]
extract = { rename: "dataset.csv" }

That's not quite right though since I don't think you can put a dict in an array.

@oxinabox
Copy link
Contributor

That is what I was saying.

@StefanKarpinski
Copy link
Member

Only took me four days for the same thing to occur to me 😁

@oxinabox
Copy link
Contributor

[dataset-A.download]
sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = [ # multiple URLs to try
   "https://server1.com/path/to/dataset.csv.gz",
   "https://server2.com/path/to/dataset.csv.gz",
]
postfetch.extract = ["gz"]
postfetch.rename ="dataset.csv" 

With the stern rule rename always occurs after extract, and that omitting either results in identity/noop.

@staticfloat
Copy link
Member Author

Having thought about this for a bit, I am uncomfortable with the coupling between rename and git-tree-sha1 (if you change rename you're going to need to change git-tree-sha1). I'm also uncomfortable with how rename doesn't make sense when dealing with a .tar.gz, since if you're going to extract a file, you kind of don't care what the .tar.gz file's filename was, and renaming something after extracting doesn't make sense in that case.

I think I would rather have extraction only be an option in the well-defined case; where we have a container (like a .tar.gz) and that file structure is stored within it; this would make extract and basename mutually exclusive; either you're extracting things, or you're downloading a single file. basename will still interact with git-tree-sha1, but I'm willing to forgive that.

For more complex usecases, I think I would rather push this off onto a more advanced Pkg concept, which I have helpfully written up a big "thing" about over here: #1234 (whooo I got a staircase issue number! Lucky day!). Even if that's not something we want in Pkg, I still think restricting the flexibility here is going to help us keep a sane, simple design.

@StefanKarpinski
Copy link
Member

Making extract and basename mutually exclusive extraction options seems sane to me. Maybe in that case calling the option filename would be more obvious than basename which felt more applicable to both files and directories, but of course in the case of a directory, there’s no need for an option to control the name.

When it comes to extraction, we should be very strict about how extraction is allowed: it should only ever produce files under the target location. I know some archive formats allow other destinations, which we should make sure to prevent.

@staticfloat
Copy link
Member Author

Yeah, I like filename better as well.

When it comes to extraction, we should be very strict about how extraction is allowed: it should only ever produce files under the target location. I know some archive formats allow other destinations, which we should make sure to prevent.

I want to make sure that extraction can work everywhere, right now with .tar.gz we have pretty good support (since we bundle 7zip with Julia), if we allow people to download non-BB generated things, we may want to widen that to .zip and .tar.bz2 as well (which would also be pretty well supported). Beyond that, there is some desire for .tar.xz just because it compresses pretty well, but the long tail of distro support doesn't have our backs on that one quite yet. We could conceivably ship binaries of tar and xz for all platforms, add it as a lazy Artifact to Pkg itself (stored in a .tar.gz of course, haha) and then we'd be able to do it..... but for now, I argue let's just stick with a small subset of things we already know works.

@staticfloat staticfloat mentioned this issue Aug 1, 2019
3 tasks
bors bot added a commit that referenced this issue Aug 15, 2019
1277: Add Artifacts to Pkg r=StefanKarpinski a=staticfloat

This adds the artifacts subsystem to Pkg, [read this WIP blog post](https://github.com/JuliaLang/www.julialang.org/pull/417/files?short_path=514f74c#diff-514f74c34d50677638b76f65d910ad17) for more details.  Closes #841 and #1234.

This PR still needs:

- [x] A `pkg> gc` hook that looks at the list of projects that we know about, examines which artifacts are bound, and marks all that are unbound.  Unbound artifacts that have been continuously unbound for a certain time period (e.g. one month, or something like that) will be automatically reaped.
- [x] Greater test coverage (even without seeing the codecov report, I am certain of this), especially as related to the installation of platform-specific binaries.
- [x] `Overrides.toml` support for global overrides of artifact locations

Co-authored-by: Elliot Saba <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.