Skip to content

sjqtentacles/sml-git

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sml-git

Pure Standard ML git plumbing — the byte-level object, packfile, index and ref formats that every git implementation shares, with no networking. Loose objects (blob/tree/commit/tag), SHA-1 object ids, zlib (de)compression, packfile decoding with full ofs-delta and ref-delta reconstruction, the dircache (.git/index), and ref parsing (loose refs + packed-refs) — all as pure, deterministic codecs over byte strings.

zlib and SHA-1 come from the vendored sml-deflate (which bundles sml-codec and a string-based Zlib facade). No FFI, no threads, no clock, no network — and deterministic, byte-identical under both MLton and Poly/ML.

A network fetch/push transport (the git smart-HTTP / ssh protocol) is deliberately out of scope: that would be a separate, quarantined IO tool on top of this pure format layer. Everything here works against on-disk git objects with zero IO of its own.

Status

  • 114 assertions, green on MLton and Poly/ML, with byte-identical output.
  • Validated against real git fixtures committed under test/fixtures/, generated by the system git CLI (generate.sh) with pinned identities/dates so every object id is reproducible:
    • hashObject reproduces git's exact 40-hex oid for every object git wrote — blob, tree, commit, annotated tag (cross-checked with git hash-object / git rev-parse).
    • loose objects inflate → parse → re-hash back to the same oid; encodeLoose round-trips.
    • two real packfiles of the same repo — one with OBJ_OFS_DELTA (from git repack) and one with OBJ_REF_DELTA (from git pack-objects --no-delta-base-offset) — decode to all 10 objects, and every reconstructed object (including a delta-encoded blob and a delta-encoded commit) re-hashes to git's oid.
    • the real .git/index (v2) parses to the same 3 entries git ls-files reports; packed-refs and loose refs parse to the expected values.
  • Basis-library only; vendors sml-deflate (Layout B), so the repo builds standalone.

Install

With smlpkg:

smlpkg add github.com/sjqtentacles/sml-git
smlpkg sync

The library's MLB pulls in the vendored sml-deflate (zlib) + sml-codec (SHA-1, base16), so structure Git and those structures all come into scope.

Quick start

(* bytes are raw strings: one char per byte, 0-255 *)

(* hash an object exactly as git does: SHA-1 over "<type> <len>\0<payload>" *)
val oid = Git.hashObject (Git.Blob "hello\n")
(* "ce013625030ba8dba906f756967f9e9ca394464a" — same as `git hash-object` *)

(* decode a loose object straight off disk (zlib + parse) *)
val obj = Git.decodeLoose loosebytes
val Git.Commit {tree, parents, author, message, ...} = obj

(* parse a tree payload into (mode, name, oid) entries *)
val entries = Git.parseTree treepayload

(* decode a packfile (.pack + .idx), applying ofs/ref deltas *)
val pack = Git.Pack.parse {pack = packbytes, idx = idxbytes}
val all  = Git.Pack.objects pack            (* (oid, obj) list, deltas applied *)
val one  = Git.Pack.lookup pack oid         (* obj option *)

(* parse the dircache and refs *)
val staged = Git.Index.parse indexbytes     (* {path, id, mode, size} list *)
val target = Git.Ref.parse "ref: refs/heads/main\n"   (* Symbolic "refs/heads/main" *)
val refs   = Git.Ref.parsePacked packedrefsbytes      (* (name, oid) list *)

Demo

make example runs examples/demo.sml over the real fixture objects — decoding a loose commit, walking its tree, and reconstructing a delta-encoded blob out of a packfile, verifying every oid against git's:

sml-git demo
============
loose object    : 94c43da0c98a3a96c58f00d6e5a06aa70c0dd410
  type          : commit
  tree          : 86fa20b61dbf28683a0ae91e87fcdab4e0854186
  parent        : 0f0b31f0d114015c2b3de56e88c2781c8564b2de
  author        : Fixture Author <author@example.com> 1700000000 +0000
  message       : second commit: edit big.txt and add docs/note.txt

  hashObject    : 94c43da0c98a3a96c58f00d6e5a06aa70c0dd410
  oid matches   : true

tree 86fa20b61dbf28683a0ae91e87fcdab4e0854186 :
  100644 ee16ef7b005794b50717c165021d16044454c00b  big.txt
  40000 ebe2cbff43af78278245c9487c67456f2f3eb4e0  docs
  100644 ce013625030ba8dba906f756967f9e9ca394464a  hello.txt

packfile        : 10 objects (version 2)
  delta object  : f0d872f6993b3b84102595978057dc29962f45a6
  reconstructed : 32499 bytes
  oid matches   : true

API

type treeEntry = { mode : string, name : string, id : string }   (* id = 40-hex *)
type commit = { tree : string, parents : string list, author : string
              , committer : string, message : string }
type tag    = { object : string, typ : string, tag : string
              , tagger : string, message : string }

datatype obj = Blob of string
             | Tree of treeEntry list
             | Commit of commit
             | Tag of tag

val objectType  : obj -> string                 (* "blob"|"tree"|"commit"|"tag" *)
val serialize   : obj -> string                  (* "<type> <len>\0<payload>" bytes *)
val payload     : obj -> string
val hashObject  : obj -> string                  (* 40-hex SHA-1 oid, = git's *)
val encodeLoose : obj -> string                  (* zlib-compressed loose object *)
val decodeLoose : string -> obj                  (* inflate + parse *)

val parseObject : { typ : string, payload : string } -> obj
val parseTree   : string -> treeEntry list
val parseCommit : string -> commit
val parseTag    : string -> tag

structure Pack : sig
  type pack
  val parse   : { pack : string, idx : string } -> pack
  val objects : pack -> (string * obj) list      (* deltas applied *)
  val lookup  : pack -> string -> obj option
  val count   : pack -> int
  val version : pack -> int
end

structure Index : sig
  type entry = { path : string, id : string, mode : int, size : int }
  val parse   : string -> entry list             (* dircache / .git/index v2 *)
  val version : string -> int
end

structure Ref : sig
  datatype ref = Direct of string | Symbolic of string
  val parse       : string -> ref                (* a loose ref file's contents *)
  val parsePacked : string -> (string * string) list   (* packed-refs *)
end

Conventions & notes

  • Bytes as string. Object payloads, blob contents and SHA-1 outputs are raw byte strings (one char per byte, 0–255), matching the rest of the sjqtentacles crypto/codec family.
  • Oids are 40-hex. Everything you see is a lowercase 40-char hex oid, exactly as git rev-parse prints it; the 20 raw bytes git stores inside tree entries and ref-deltas are converted to/from hex at the library boundary.
  • Pack.parse takes both .pack and .idx. The idx supplies each object's byte offset, which bounds the per-object zlib stream exactly (the streams are stored back-to-back with no length markers, and the bundled inflate verifies a trailing Adler-32). This is also how real git pairs the two files.
  • Deltas. Both ofs-delta (relative base offset) and ref-delta (base oid) are reconstructed by applying the copy/insert instruction stream against the fully-resolved base object.
  • 32-bit-safe. All binary reads stay below 2³¹ so MLton's default 32-bit Int never overflows; packs larger than 2 GiB (64-bit offsets) are rejected rather than silently mis-decoded.
  • Malformed input raises Git.

Build & test

make test        # MLton
make test-poly   # Poly/ML (via tools/polybuild)
make all-tests   # both
make example     # build + run examples/demo.sml over the fixtures
make clean

Regenerate the fixtures (requires a system git):

sh test/fixtures/generate.sh

License

MIT — see LICENSE.

About

Pure Standard ML Git plumbing: object, packfile, index and ref formats (MLton + Poly/ML, deterministic).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors