Skip to content

Fix hardlink support for archive creation in FilesFromDisk() by tracking inodes.#64

Open
solvingj wants to merge 1 commit intomholt:mainfrom
solvingj:jwiltse/fix_hardlink_detect
Open

Fix hardlink support for archive creation in FilesFromDisk() by tracking inodes.#64
solvingj wants to merge 1 commit intomholt:mainfrom
solvingj:jwiltse/fix_hardlink_detect

Conversation

@solvingj
Copy link
Copy Markdown

@solvingj solvingj commented Jan 30, 2026

Hardlink Support Implementation for mholt/archives (archive creation)

Summary

This branch adds proper hardlink detection and preservation for tar archives in the archives library during archive creation. Hardlinks are now correctly identified during file gathering and written to tar archives using tar.TypeLink entries, significantly reducing archive size for packages with many hardlinked files.

Changes Made

1. archives.go

Added hardlink detection to FilesFromDisk():

  • Import syscall for accessing inode information
  • Track inodes using {dev, ino} key to detect hardlinks
  • When a file with Nlink > 1 is encountered:
    • Check if the inode was seen before
    • If yes, set LinkTarget to the first occurrence's path
    • If no, record this path as the first occurrence
  • Only check for symlinks if not already identified as a hardlink

Code additions (~25 lines):

// Track inodes to detect hardlinks
type inodeKey struct {
    dev uint64
    ino uint64
}
inodeMap := make(map[inodeKey]string)

// In the walk loop:
if info.Mode().IsRegular() {
    if stat, ok := info.Sys().(*syscall.Stat_t); ok && stat.Nlink > 1 {
        ikey := inodeKey{
            dev: uint64(stat.Dev),
            ino: stat.Ino,
        }
        
        if firstPath, exists := inodeMap[ikey]; exists {
            linkTarget = firstPath  // This is a hardlink
        } else {
            inodeMap[ikey] = nameInArchive  // First occurrence
        }
    }
}

2. tar.go

Modified writeFileToArchive() to handle hardlinks:

  • Distinguish between hardlinks and symlinks
  • For hardlinks (LinkTarget set on regular files):
    • Set hdr.Typeflag = tar.TypeLink
    • Set hdr.Linkname = file.LinkTarget
    • Set hdr.Size = 0 (no content stored)
  • Skip writing file content for TypeLink entries

Code modifications (~20 lines):

// Handle hardlinks (LinkTarget is set for regular files with multiple hard links)
if file.LinkTarget != "" && file.Mode().IsRegular() {
    // This is a hardlink, not a symlink
    hdr, err = tar.FileInfoHeader(file, "")
    if err != nil {
        return fmt.Errorf("file %s: creating hardlink header: %w", file.NameInArchive, err)
    }
    hdr.Typeflag = tar.TypeLink
    hdr.Linkname = file.LinkTarget
    hdr.Size = 0 // hardlinks don't store content
} else {
    // Regular file, directory, or symlink
    hdr, err = tar.FileInfoHeader(file, file.LinkTarget)
    ...
}

3. hardlink_test.go (new file)

Added comprehensive test coverage:

  • TestHardlinkDetection: Verifies hardlink detection in FilesFromDisk()
  • TestHardlinkInTarArchive: Verifies correct tar headers (TypeLink, zero size, Linkname)
  • TestHardlinkExtraction: Verifies hardlink metadata is preserved during extraction

Results

Tested with Git 2.50.1 for linux/amd64 (a package with 150 hardlinks):

Metric Before After Improvement
Archive size 1.3GB 59MB 95% reduction
Extracted size 3GB 154MB 95% reduction
Hardlinks preserved 0 150 Fixed

Extraction Support

The extraction side already properly handles hardlinks:

  • Extract() sets FileInfo.LinkTarget from tar.Header.Linkname
  • Consumers can check file.Header.(*tar.Header).Typeflag == tar.TypeLink
  • Or check if LinkTarget != "" for entries

Platform Support

  • Unix-like systems (Linux, macOS): Full support using syscall.Stat_t
  • Windows: May need additional implementation (current code uses syscall.Stat_t)
    • Could add build tags: // +build !windows
    • Or use conditional compilation

Backward Compatibility

  • ✅ No breaking API changes
  • ✅ All existing tests pass
  • ✅ Hardlink support is automatic (no configuration needed)
  • ✅ Archives without hardlinks work exactly as before

Testing

Run tests:

go test -v -run TestHardlink
go test ./...

All tests pass, including:

  • New hardlink tests
  • All existing archive format tests
  • fs.FS integration tests

Future Considerations

  1. Windows support: Add platform-specific hardlink detection
  2. FilesFromFS: Add similar hardlink detection for fs.FS sources
  3. Other formats: Consider hardlink support for zip (if format supports it)
  4. Performance: Inode tracking adds minimal overhead (O(n) map lookups)

Related Issues

This addresses the same issue as mholt/archiver PR #171 from 6 years ago, but implemented for the newer archives library.

@solvingj
Copy link
Copy Markdown
Author

Analysis: Hardlink Support in mholt/archiver vs mholt/archives

Investigation Summary

What Happened to gnu-hardlinks.tar?

The test file testdata/gnu-hardlinks.tar was:

  • Added: In commit 9ea51e7 (Nov 2019) as part of PR #171
  • Removed: In commit 10c5080 (Jan 2022) during the v4 rewrite

The Original Hardlink Fix (archiver v3)

Commit: 9ea51e7370bcb0ca5aa6318b481c2fd51697db9f (Nov 14, 2019)

Problem it solved: When extracting tar archives with hardlinks AND using path stripping (extracting a subdirectory), the Linkname field needed to be adjusted.

The fix (in old tar.go):

// relativize any hardlink names
if th.Typeflag == tar.TypeLink {
    th.Linkname = filepath.Join(filepath.Base(filepath.Dir(th.Linkname)), filepath.Base(th.Linkname))
}

Test case (gnu-hardlinks.tar):

dir-1/
dir-1/dir-2/
dir-1/dir-2/file-a         (regular file, 4 bytes)
dir-1/dir-2/file-b         (hardlink to dir-1/dir-2/file-a)

The test extracted dir-1/dir-2 and expected both files to still be hardlinked.

Why Was It Removed in v4?

The v4 rewrite (10c5080, Jan 2022) completely redesigned the API:

Old API (v3):

  • Archive(sources []string, destination string) - file path based
  • Unarchive(source, destination string) - unpacks to filesystem
  • Extract(source, target, destination string) - extracts specific path with path manipulation

New API (v4):

  • Archive(ctx, output io.Writer, files []FileInfo) - streaming, generic
  • Extract(ctx, sourceArchive io.Reader, handleFile FileHandler) - streaming callback

Key difference: The v4 API is streaming and callback-based. It doesn't:

  1. ❌ Extract directly to filesystem (that's the consumer's job)
  2. ❌ Manipulate paths during extraction (consumer handles that)
  3. ❌ Need the "relativize hardlink names" fix (consumer sees raw archive data)

Does v4 Support Hardlinks?

YES - But differently:

Extraction (Current State):

// From tar.go line 217
file := FileInfo{
    FileInfo:      info,
    Header:        hdr,
    NameInArchive: hdr.Name,
    LinkTarget:    hdr.Linkname,  // ✅ Preserved!
    Open: func() (fs.File, error) {
        return fileInArchive{io.NopCloser(tr), info}, nil
    },
}

The LinkTarget field IS populated from hdr.Linkname, so hardlink information is preserved during extraction.

Creation (Current State):

// From tar.go line 82
hdr, err := tar.FileInfoHeader(file, file.LinkTarget)

The library DOES use file.LinkTarget when creating headers, but...

THE PROBLEM: There's no code to detect hardlinks during file gathering!

What's Missing in v4

In archiver.go (FilesFromDisk):

  • ❌ No inode tracking
  • ❌ No hardlink detection (Nlink > 1 check)
  • LinkTarget is only set for symlinks

This is the same issue we fix in this fork.

Why Was the Test Removed?

Looking at the v4 rewrite commit, it was a massive refactoring (411 line changes to README alone). The test was removed because:

  1. API changed: Old Extract() method no longer exists
  2. Philosophy changed: Library doesn't extract to filesystem anymore
  3. Not a regression: The old fix was for a specific extraction scenario that the new API doesn't handle

The test wasn't removed because hardlinks became unsupported - it was removed because the specific use case (extracting with path manipulation) no longer exists in the API.

Comparison with Our Fix

Our Implementation (archives fork):

Detection: Track inodes in FilesFromDisk()
Creation: Write TypeLink headers with Linkname
Extraction: Already worked (LinkTarget preserved)
Tests: Comprehensive coverage

archiver v4:

Extraction: LinkTarget preserved
Detection: No hardlink detection
Creation: Can write hardlinks IF LinkTarget is set (but it never is)
Tests: No hardlink tests

Copy link
Copy Markdown
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, we'll give it a shot.

@mholt
Copy link
Copy Markdown
Owner

mholt commented Feb 11, 2026

Will need to fix Windows compilation though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants