CIFFOLD 0.5.4 Pre-Release
1 February 2006
by Kostadin Mitev, Georgi Todorov and Herbert J. Bernstein
User's Manual
Copyright © Kostadin Mitev 2005, 2006
Work funded in part by the International Union of Crystallography
under a grant to Dowling College.
- Copyright and Distribution
- Introduction
- Installation
- Using CIFFOLD
- List of Options
- Default Options
- Logical integrity checks
- Terse Formatting
- Non-terse Formatting
- MAP
- Command-line Arguments
- How are files folded/wrapped
- How are files unfolded/unwrapped
- OTHER SOURCES
- Change Log
- Known Bugs
1. Copyright and Distribution
This software is covered by the GNU General Public License.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
2. INTRODUCTION
Until recently, information in Crystallographic Information File (CIF) format was limited to 80 characters per line
and there was no way to represent longer data items and comments faithfully. With the release of CIF version 1.1,
the maximum line size has been increased to 2048 characters and a protocol has been specified for folding and
unfolding text fields and comments that exceed any given maximum line size. The C/C++ program CIFFOLD implements
this line folding/unfolding protocol without loss of the semantic information in the files. This allows new,
long-line CIF 1.1 files to be converted to a form suitable for processing by existing software for 80-character
line CIF 1.0 files and to recover long-line CIF 1.1 files from CIFs produced by CIF 1.0
software. In addition to folding and unfolding, the software performs logical integrity checks and
allows the user to set a variety of options providing control over the tradeoff between faithful versus compact representations.
3. INSTALLATION
You must first obtain a copy of the source kit of CIFFOLD, CIFFOLD.tar.gz.
To unpack the file on a UNIX machine type the command
gunzip CIFFOLD.tar.gz
and then the command
tar -xvf CIFFOLD.tar
to extract the files in a subdirectory named CIFFOLD_0.5.4 under the
current directory.
To create the executable run
make
in the CIFFOLD_0.5.4 directory, which will create the executable named
"ciffold".
To run the program interactively simply type the command "./ciffold -g" and hit enter.
4. USING CIFFOLD
To run ciffold`s GUI form the UNIX prompt type
./ciffold -g
in the CIFFOLD directory and you will be shown the startup menu. The menu is comprised
of several windows that are shown one by one. The top frame of each window contains
the option, while the bottom one contains either the available options from which
you have to select one or he prompt "Enter:" after which you have to enter
your choice and hit enter. You can select an option by using the up and down
arrow keys to highlight the desired option and hit enter.
5. LIST OF OPTIONS:
- ENTER INPUT FILE: This is the first window that you will see after you
run the program. You have to enter the name of the
input file after the prompt "Enter:" and hit enter.
If the file does not exist you will be issued an error message giving a choice to
either "exit" the program or "continue" by having to reenter the name of the input file.
- ENTER THE OUTPUT FILE: This is the second window that will be shown
after you have entered the name of the input file. You
have to enter the name of the output file here. If the name of the output file
coincides with the name of the input file an error message will be shown
asking to either "exit" the program or "continue" by having to reenter the
name of the output file. If the name of the output file you have entered
coincides with a file in the directory the program is being run from,
you will be warned and allowed to either "continue" using the same name or "change"
the name of the output file.
- File version: This option allows you to insert or change the file version given
by the special comment "#/#CIF_filevers" where filevers is the version number of the file.
You can choose using the arrow keys from the following options:
a) 1.0- file version will be 1.0
b) 1.1- file version will be 1.1
c) Do not change -the file version will not be changed or not be given if one does not exist.
- Folding (Yes) or Unfolding (No): This option allows you to choose between folding and unfolding.
If "yes" is selected then the given input file will be folded according to the folding unfolding protocol of
cifs (see "How are files folded ?" below for detailed description on the process). If you choose
"no" then the input file will be unfolded.
- Minimal Folding (Yes or No): This option allows you to suppress the reformatting of loops.
If no other options are selected, this results in a minimal amount of folding, so the the
output files is organized the same as the input files, except when long lines to text fields
containing long lines are ewncountered.
- Create a MAP ?: If you have chosen folding then you will be asked whether to create a MAP of the input file. You can
select "yes"to create or "no" not to(see MAP bellow for further information).
- Terse Folding/Unfolding?: This option allows you to choose between terse(see "Terse formatting" below) or nonterse( see
"Nonterse formatting" below). To choose terse select "yes" otherwise select "no"
- Terse formatting on loops?: This option allows you to choose between terse or nonterse formatting on loops.
To choose terse select "yes" otherwise select "no". Terse formatting on loops attempts to
format the data items of loops having items(data + tags) larger than a given by you number
according to the Terse formatting rules(see "Terse formatting" below)
- How many items is a big loop? : This option is displayed after selecting "yes" from
"Terse formatting on loops ?" option. This allows you to specify how large should a large
(tags + data) loop considered to be. The input range is an integer between 5 and (232)-1.
For example if the value entered is 70 then the data items of every loop having number of
data + number of tags bigger than or equal to 70 will be tersely(see "Terse formatting" below) formated.
- Preserve leading blanks ?: Allows you to choose between preserving and disregarding
the leading blanks in the file. To choose preserve select "yes" otherwise select "no".
Leading blanks are the blanks at the beginning of each line. The blanks at the beginning
of a line inside a textfield or folded comment are not considered leading blanks and are preserved regardless of your choice.
- Process the entire file ?: This option allows you to choose between processing the entire file or only portions of the
file(file chunks). To process the entire file select "yes" otherwise select "no".
- Enter chunk pairs or END to continue: This option is displayed if you have selected "no" from
the "Process the entire file ?" option. On the prompt "Enter:" you can enter pairs one at a time consisting
of n-n pairs where n is nonnegative integer in the range 0 to 232-1. After you have entered a pair
hit enter to enter the next one.
The first integer of the pair has to be bigger than the integers in any previously entered pair.
The second integer of the pair has to be bigger than the first integer in the
pair. To finish entering the pairs type "end" and hit enter.
- Format only comments? : This option allows you to choose to format only the comments
in the file. To choose format only comments select "yes" otherwise select "no". If you
select "yes" then only the comments of the file are folded/unfolded while the rest of the file is not being formated.
- Format everything except comments ?: This option allows you to choose to format only the data of
the input file (without the comments). If you want to format only the data select "yes" otherwise select "no".
If you selected "yes" only the data of the input file will be folded/unfolded, the comments will not be formated.
- Is this a dictionary file ?: This option allows to specify whether the input file is
dictionary or not. If it is select "yes" otherwise select "no". As of the current version of CIFFOLD
this option is not fully implemented and does not have any effect in the processing of the file.
- Output the warning messages ?: The warning messages are the ones that warn you about changes made by the program such as
changing a delimiter of a string.(see "Logical integrity checks" below).
- Output the error messages ?: Select yes if you want to have the error messages outputted
as a special comment i.e. one that starts with #_# at the end of the file. Error messages contain
the type of error that occurred in the logical integrity checks of the file (see "Logical integrity checks" below).
- Read from a MAP?: This option allows you to specify if the file should be formated
according to its MAP file(see "MAP" below) or no. To use a MAP file to format it select
"yes" otherwise select "no". The MAP file should be at the end of the input file with each line
of the MAP file prefixed with the special comment #_M# . If there the MAP file does not exist
or it does not conform to its specification then the default options for file unfolding are used
to format the rest of the file(see "Default options" below).
- The column with respect to which the data should be aligned: This option is provided
only when unfolding files. It allows you to left justify data associated to a tag with respect
to the column specified by you. If you do not want to left justify the data
either enter "0" and press enter or just press enter. The option is useful when
unfolding tersely formated file without using a MAP and allows to layout the data
in a more structural and easy to read and understand way. If a column is specified then the
program will attempt to left justify each data field with respect to the given column, if
this is not possible then at least a single space will be used to separate the tag from
the data to produce a valid cif file.
- Specify the maximum line length?: This options lets you specify the maximum line
length of the file that will be outputted and it appears only if you choose folding. It takes
as an input a positive integer between 60 and 2048. As of the version 0.3 of CIFFOLD this options
is implemented only when folding files. The maximum line length is forced to 2048 for unfolding.
6. DEFAULT OPTIONS
CIFFOLD has some default options for the options that have not been selected.
These options are used if during processing of the file something goes wrong
for example if the file should be formated according to a MAP but it does not
contain a MAP or the MAP becomes invalid at some point then the default
options will be used and the user will be warned. The program
uses the following default options:
- For folding:
a) Do not change the file version
b) Do not fold the file tersely
c) Do not fold tersely large loops
d) Preserve the leading blanks
e) Process the entire file
f) Do not format only comments
g) Do not format everything except comments
h) The file is not a dictionary file
i) Do not output the warning messages
j) Output the error messages
k) Specify the maximum line length is set to 80
l) Do reorganize loops
- For unfolding:
a) Do not change the file version
b) Do not fold the file tersely
c) Do not fold tersely large loops
d) Preserve the leading blanks
e) Process the entire file
f) Do not format only comments
g) Do not format everything except comments
h) Do not specify a column with respect to which the data should be left justified(0).
i) The file is not a dictionary file
j) Do not output the warning messages
k) Output the error messages
7. LOGICAL INTEGRITY CHECKS
CIFFOLD checks the file for some basic logical integrity errors and generates warnings about them.
The checks performed are:
- is there corresponding data to a tag
- do two tags have the same name within one datablock
- do two data-block headers have the same name
- are there non delimited deprecated tags such as global_ , start_ and stop_
- is there non delimited occurrence of save_ if the file is not a dictionary.
- are there nested loops
- is there a longer than the maximum allowed length(2048) line
- is there a delimited data field such that its opening delimiter does not match its closing delimiter
- is there a datablock with no name
- are there tags longer than 80 characters
- are the number of the data items in the loop an exact multiple of the number of tags.
In addition to the logical integrity checks CIFFOLD will detect and change the
delimiter of a string with the following peculiarity:
The same character as the delimiter appears right after the opening delimiter or
before the closing delimiter. The delimiter of such a string will be changed to
its alternative one for example " to ' and vice
versa so the string "rambo"" will be changed to 'rambo"'. A warning
will be issued about the change and if the
option "Output the warning messages ?" is selected then it will be outputted as
a special comment at the end of the output file.
A warning will be issued if there is a presence of non delimited reserved character such as([, ], _, etc.)
8. TERSE FORMATTING
If the option terse is chosen then the program will attempt to reduce the amount of white
space to a minimum by putting as much information as possible on one line, while the file
is still a valid cif. This option is considered user unfriendly and is used to reduce the
size and length of the file. If a string is delimited with a single/double quote and
immediately after the opening delimiter there is another single/double quote or
immediately before the closing delimiter there is a single/double quote. Then the
delimiter is changed to its alternative which is single quote for the double quote
and vise versa. For example if we have a string of the type ""rambo" it will be
converted to a string of type '"rambo'. This is done to avoid ambiguity and improve
the clarity of the content of cif files. Any single hashmark will be put on new line.
9. NON TERSE FORMATTING
- If the option Nonterse is selected then the program uses the following rules to format the input file:
- every tag is put on a new line
- the data corresponding to a tag is put on the same line as the tag if it will fit
- every special tag such as loop_ data_ etc. is also put on a new line
- if the data in a loop can be aligned in columns and rows such that
one row holds the as many distinct data as are the tags in the
loop it is done. If the data cannot be aligned in this way then
the original formatting is preserved as much as possible.
- If a string is delimited with a single/double quote and immediately
after the opening delimiter there is another single/double
quote or immediately before the closing delimiter there is a
single/double quote. Then the delimiter is changed to its alternative
which is single quote for the double quote and vise versa. For example
if we have a string of the type ""rambo" it will be converted to a
string of type '"rambo'. This is done to avoid ambiguity and improve the
clarity of the content of cif files.
- if there is a long data field that is not a text field (i.e. delimited by ";" then it is converted to a text field and folded.
10. MAP
The optional map is used to save information on the original positions of information
when a files is folded.
The MAP is a file that contains of "dh" for data and h is the delimiter of the data
either ;, ' , " or nothing if no delimiter is used "sn" for space and "tn" for tabs
where n is the number of spaces/tabs. For each line of the input file there is a line in
the MAP file that shows the layout of the line. For example d's7d shows that
there is data delimited by a single quote followed by 7 spaces and nondelimeted data.
The MAP file is then concatenated to the output file such that each line is prefixed by
#_M# indicating that the line is part of the map file. The MAP file is
useful if a file is folded and then it is necessary to recostruct exactly the same file by unfolding it.
WARNING: As of version 0.3 of CIFFOLD the line length of the map may be of arbitrary length. This means
that if there are 60 separate items on a single line of the input file the corresponding line in the
MAP file will be more than 60 characters long and if maxline length has been selected to be 60 the MAP
will exceed it.
11. Command-line Arguments
ciffold [-i input_cif] [-o output_cif] [-x n-n,n-n]
[-l n] [-m n] [-C n] [-p a[w][e]] [-v file_vers]
[-c] [-d] [-e] [-g] [-w [-n]] [-u] [-L] [-t] [-h] [-M] [-V]
If you want to run CIFFOLD with the options specified on the command line
you can do that by typing "./ciffold specify the options here" and then hit enter.
The options provided are:
[-i input_cif] corresponds to "ENTER INPUTFILE:" (see above)
for command line use, a "-" indicates standard input
input_cif defaults to stdin
[-o output_cif] corresponds to "ENTER OUTPUT FILE:" (see above)
for command line use, a "-" indicates standard output
output_cif defaults to stdout
[-d ] corresponds to "Is this a dictionary file:?" with
value of "yes" (see above)
[-u ] corresponds to "Folding (Yes) or Unfolding (No):"
with value of "no" (see above)
[-w ] corresponds to "Folding (Yes) or Unfolding (No):"
with a value of "yes" (see above)
[-n ] corresponds to the "Minimal Folding (Yes or No)"
with a value of "yes" (see above)
[-m maxline] corresponds to "Specify the maximum line length?:"
(see above) Note: this option is considered
only when folding files. In unfolding the
maximum line length will be forced to be 2048.
[-v file_version] corresponds to "File version:" valid
file_versions are 1.0 or 1.1 (see above)
[-t ] corresponds to "Terse Folding/Unfolding?:" with a
value of "yes" (see above)
[-l integer] corresponds to "Terse formatting on loops?:" with a
value of "yes" and digit corresponds to "How many
items is a big loop? :" (see above)
[-L] corresponds to "Preserve leading blanks" with a
value of "yes" (see above)
[-c] corresponds to "Format only comments:" with a value
of "yes" (see above)
[-e] corresponds to "Format everything except comments:"
with a value of "yes" (see above)
[-C integer] corresponds to "The column with respect to which
the data should be aligned:" (see above)
[-p character] Valid characters for "character" are:
"a"- corresponds to "Output the error message:"with a value
of "yes" and "Output the warnings:" with a value of
of "yes".
"w"- corresponds to "Output the warnings:" with a value of
"yes".
"e"- corresponds to "Output the error messages:" with a
value of "yes" (see above)
[-g] Takes no values and invokes the GUI interface
[-M] If folding corresponds to "Create a MAP?" with a
value of "yes".
If unfolding corresponds to "Read from a MAP?"
with a value of "yes" (see above)
[-h] Takes no values. Prints a help message and exits.
[-x n-n,n-n] corresponds to "Process the entire file?" with a value
of "no". n-n correspond to "Enter chunk pairs or END
to continue:"with n-n being a string where the first
n is the starting integer and the following is the
ending. Example: if you want to format only the chunks
9-10 40-70 you would specify that as -x 9-10,40-70
[-V] Takes no values. Prints the current version and exits.
CIFFOLD will make two passes through the file. On the
first pass it will perform logical integrity checks,
issue the appropriate warnings and error messages and
will create a temporary file where the input file will
be stored. It will also create a MAP for the file if
the MAP option is selected and will create a temporary
file for the MAP. Some additional information about
the file is gathered as well. On the second pass
CIFFOLD will actually fold/wrap the file according to
the following rules:
Lines will be folded/wrapped only if they exceed the
maximum line length. Thus if a text field has lines
that are less than the maximum allowed line length it
will not be folded/wrapped. Strings that have lines
less than the maximum allowed line length but they end
beyond the column of the maximum allowed line length
will either be brought back to the left by deleting
blank characters or will be placed on a new line if
the former is not possible. The loops will be formated
according to the following rules:
Every tag is placed on a new line. If possible the
data tokens in the loop will be aligned into rows and
columns such that each row contains as many data
tokens as are the number of tags. If such alignment is
not possible the original formatting will be preserved
as much as possible.
The option preserve the leading blanks will not
preserve the leading blanks for the tokens that fall
within a loop. Unless the trailing blanks fall within
a text field they will be deleted.
When finished processing the temporary files are
deleted.
CIFFOLD will make two passes through the file. On the
first pass it will perform logical integrity checks,
issue the appropriate warnings and error messages and
will create a temporary file where the input file will
be stored. It will also create a temporary MAP file
which will hold the MAP of the input file if it exists
and the MAP option is selected. Some additional
information about the file is gathered as well. On the
second pass CIFFOLD will actually unfold/unwrap the
file according to the following rules (if the default
options are used):
Every tag will be placed on a new line. A data
associated with a tag will be placed on the same line
as the tag if: the resulted line length does not
exceed the maximum allowed line length and the new
line characters between the tag and the data are not
more than 1
.
Example:
_a_tag
data
and
_a_tag data
will be unfolded/unwrapped as:
_a_tag data
but:
_a_tag
data
will be unfolded/unwrapped as:
_a_tag
data
The loops will be formated according to the following
rules unless the -n option has been selected:
Every tag is placed on a new line. If possible the
data tokens in the loop will be aligned into rows and
columns such that each row contains as many data
tokens as are the number of tags. If such alignment is
not possible the original formatting will be preserved
as much as possible.
Unless the trailing blanks fall within a text field
they will be deleted.
The option preserve the leading blanks will not
preserve the leading blanks for the tokens that fall
within a loop. The only way the original file can be
exactly recovered is by using the MAP option.
When finished processing the temporary files are
deleted.
14. OTHER SOURCES:
For information about cif files visit: http://www.iucr.org/.
For information about the the folding/unfolding protocol of cifs visit:
http://www.iucr.org/iucr-top/lists/cif-developers/msg00147.html
15. Change Log
- Release 0.5.4, 1 February 2006 KM+HJB
Correct text field blank stripping and unfolding
of text fields to quoted strings.
- Release 0.5.3, 30 September 2005 HJB
Add command line option -n for minimal folding,
suppressing loop reformatting.
- Release 0.5.2, 1 August 2005 HJB
Changed handling of folded quotes strings to end with backslash
in the text field to avoid an extra newline and improved handling
of embedded semicolons.
- Release 0.5.1, 25 July 2005 GT+HJB
Updated output of -h option and redirected that output to cerr.
Updated version number in all source code file headers.
- Release 0.5, 23 July 2005 HJB
Added code to fold comments and text fields on a blank
when available.
Moved temporary files to /tmp
Cleaned up some white space in FUCIF.c
(HJB)
- Release 0.4.5, 11 July 2005 KM, 22 July 2005 HJB
Correction in FUCIF.c to correct infinite loop on some
terminal comments discovered by I. Awuah Asiamah. (KM)
Cleanup of bad characters in README and addition of logo. (HJB)
- Release 0.4.4, 31 May 05 KM
Corrections made in FUCIF.c in outputTextField to properly terminate if
end of file is reached before the closing delimiter of the
textfield. Corrections made in FUCIF.c in formatLoop to insert a new
line before comment that falls in a loop, it is the only item of
the input line and there is a data item on the output line.
- Post Release 0.4.3, 14 May 05 KM+HJB
Added manual sections on folding and unfolding.
- Release 0.4.3, 7 May 2005 KM+HJB
Changed MAP to one that has linelength restricted to the maximum allowed
and introduced the character 'n' to represent new line
Fixed some bugs including the ouput of ambiguous closing text delimiter
and scrambling of comments when format everything except comments is used
Added local definition of isblank for systems that do not have it in ctype
Revised Makefile to try /usr/local/... for ncurses
- Release 0.4.2, 29 April 2005 KM
Corrected failure to output opening and closing
delimiters when converting a non-delimited string
to a folded textfield, and fixed the handling
of unfolding according to a MAP when it would output a
single space before it converting the folded text field
back to nondelimited string.
- Release 0.4.1, 28 April 2005 GT
Corrected "then then" and changed "formating" to
"formatting" in README..., ReadFile.cpp, getOpt.cpp.
- Release 0.4, 27 April 2005 KM, GT
-fixed concatenation of closing text delimiter with the next token
-fixed some segmentation fault problems occuring when the input file is incorrect
-in menus I updated the version of the program and fixed a bug that sets
-the maxlength of a line in folding using -g after the user's input(overwrites the user input making the option useless)
-in getOpts made the program exit upon invalid input, -V or -h(basically every argument except the valid ones will print the help and exit)
-
Release 0.3, 23 April 2005 KM
Changes in the command line options:
-x to be used instead of -h.
-h used to print a help message
-C instead of -r
-V used to print the current version and exit
-M instead -g for creating/using a MAP
-g used to invoke the GUI
-L instead -p for preserve leading blanks
-p used for print the error messages and warnings
-m to take values within the range 60-2048
The file chunks to be in form n-n,n-n instead of n-n-n-n
Corrections in code in FUCIF.c to preserve empty lines in a more consistent way.
Corrections in ReadFile.cpp to handle the appearance of "global_" within loops
Changes in menus.cpp allowing to use ciffold without any arguments by opening stdin and
stdout and passing stdin to stdout without altering it.
Fixed the MAP option to not be selected by default.
Disabled the warning "#_#WARNING: AMBIGUOUS STRING
DELIMETER CHANGED TO AN ALTERNATIVE ONE(\' to \" or \"to \')"
Made the error messages and warnings to be outputed to stderr.
- Release 0.2, 19 April 2005 HJB
Corrections to handling of command line input, enabling - as an indicator for
standard input or standard output to allow the use of ciffold as a filter.
- Release 0.1, 16 April 2005 KM, GT and HJB
Initial pre-release.
16. Known Bugs
- In some cases, mapped files are not recopnstructed correctly. Use
of the -M option is not recommended at this time.
- The temporary file is not always cleaned up.
- The line length of the MAP file is not restricted and can exceed the maximum allowed line length.
- The CIFFOLD 0.3 release appears to fold and unfold correctly formatted
CIFS, but, in some cases, invalid CIFs cause segmentation faults instead
of providing validation messages. The known cases have been addressed on
CIFFOLD 0.4, but caution is advised.
- Some combinations of the options -M -x -e -c will format the file incorrectly
which does not necessarily result in invalid cif.
Written by K. Mitev, 15 April 2005,
revised, H. J. Bernstein, 16 April
2005, 19 April 2005,
K. Mitev, 22 April 2005,
H. J. Bernstein,
K. Mitev, G. Todorov, 27 April 2005,
G. Todorov 28 April 2005,
K. Mitev 29 April 2005,
K. Mitev 6 May 2005,
H. J. Bernstein,
7 May 2005,
K. Mitev, H. J. Bernstein 14 May 2005,
K. Mitev 31 May 2005,
K. Mitev 11 July 2005, H. J. Bernstein 22 July 2005,
H. J. Bernstein 23 July 2005,
G. Todorov, H. J. Bernstein 25 July 2005,
H. J. Bernstein 1 August 2005,
H. J. Bernstein 30 September 2005,
K. Mitev, H. J. Bernstein 1 February 2006