README.md 2.54 KB
Newer Older
Roland Haas's avatar
Roland Haas committed
1
2
3
4
5
6
7
8
9
10
11
mpitar
======

What is mpitar?
---------------

mpitar is a proof of concept implementation of a MPI parallelized tar comamnd
that can currently produce tar files using multiple MPI processes
reading/writing files in parallel. It currently supports regular files,
symbolic links and directories.

Roland Haas's avatar
Roland Haas committed
12
13
14
15
16
17
Beyond the tarfile it also produces an index file that lists the tar file
content and the offset of each tar member in the tar file. This index is
included as the last file in the archive itself. A proof-of-concept chop.pl
code is included that can take (a subset of) the lines in the index file and
produce a tar file containing only those files.

Roland Haas's avatar
Roland Haas committed
18
19
20
Installation
------------

21
22
23
This has only been test on Linux and MacOS systems so far where an MPI
implemenation and
mpic++ or similar is required to compile. Please set CXX in the make file to
Roland Haas's avatar
Roland Haas committed
24
25
26
your compiler or do:

```
27
make CXX=mpicxx
Roland Haas's avatar
Roland Haas committed
28
29
30
31
32
```

Usage
-----
```
Roland Haas's avatar
Roland Haas committed
33
34
find . -type f -or -type d -or -type l >files.txt
mpirun -n 3 mpitar -c -f feather.tar -T files.txt
Roland Haas's avatar
Roland Haas committed
35
```
Roland Haas's avatar
Roland Haas committed
36
or
Roland Haas's avatar
Roland Haas committed
37
```
Roland Haas's avatar
Roland Haas committed
38
mpirun -n 3 mpitar -c -f feather.tar file1 file2 dir2 ...
Roland Haas's avatar
Roland Haas committed
39
```
Roland Haas's avatar
Roland Haas committed
40

Roland Haas's avatar
Roland Haas committed
41
Partial extraction works liks so (to extract e. g. every second file):
Roland Haas's avatar
Roland Haas committed
42
43
44
45
```
awk 'NR%2' <feather.tar.idx >every_second.idx
choptar.pl every_second.idx feather.tar | tar -t
```
46
47
48
49
50
and
```
extractindex.pl feather.tar >feather.tar.idx
```
extracts the index file from the end of the tar file.
Roland Haas's avatar
Roland Haas committed
51

52
53
54
55
56
57
58
59
60
61
62
63
64
Differences to GNU tar
----------------------

* mpitar produces PAX (POSIX) compatible tar files which differ from the GNU
  tar format that GNU tar uses by default.
* mpitar's -T option defaults to not recurse into directories, while tar does.

Within these limitations, mpitar will attempt to produce files bit-identical to
GNU tar version 1.29 if GNU tar is invoked with options `--no-recursion
--format=pax --record-size=512 --pax-option delete=?time --pax-option
exthdr.name=%d/%f.paxhdr` and the index file is added as the last file to the
archive.

Roland Haas's avatar
Roland Haas committed
65
66
67
TODO
----

Roland Haas's avatar
Roland Haas committed
68
1. ~~write index file listing file name and offset~~
Roland Haas's avatar
Roland Haas committed
69
1. ~~sort index file by file name for binary search~~ won't do since this makes chopping hard
Roland Haas's avatar
Roland Haas committed
70
1. add parallel file extractor code
Roland Haas's avatar
Roland Haas committed
71
72
1. ~~have mpitar take file and directory names on the command line~~
1. ~~recurse into directories given on the command line~~
73
1. ~~make sure it works eg on OSX (low priority though)~~
Roland Haas's avatar
Roland Haas committed
74
1. ~~make error reporting work, do not use assert() for this~~
Roland Haas's avatar
Roland Haas committed
75
76
1. use `MPI_IO` (not sure what the benefit would be)
1. provide some scaling numbers
Roland Haas's avatar
Roland Haas committed
77
1. add option to split tar file into X GB smaller tar files
Roland Haas's avatar
Roland Haas committed
78
1. make BUFFER size, number of files per work package, work package size etc. runtime options