Input data formats


Version 2.32


Manual page for Input_data_formats(PL)

Ploticus can read tabular ASCII data from files, commands, or from standard input. If you're using prefabs, your data source is specified on the command line eg. data=myfile.dat. If you're writing scripts, proc getdata is used to read data.. data can also be embedded directly into scripts.


Plotting from data fields

Plotting and data display operations are done using fields. Suppose we have a data set like this in the file myfile.dat:

   F1 2.43 0.47 PF7955
   F2 2.79 0.28 PT2705
   F3 2.62 0.37 PB2702
Suppose we want to draw a bar graph using the values in field 2, and draw error bars using the values in field 3. Fields can be specified by number, so we could use this command:
   pl -prefab vbars  data=myfile.dat  y=2   err=3
If your data set has a field name header (field names in the first row) you can reference fields using those names if you want to. For example:
   test level se    case_id
   F1   2.43  0.47  PF7955
   F2   2.79  0.28  PT2705
   F3   2.62  0.37  PB2702
..we could use this command:
    pl -prefab vbars  data=myfile.dat  header=yes   y=level  err=se
The field name header must use the same delimitation as the data proper. Field names are like variable names; they cannot contain embedded white space, comma, or quote characters. Script writers can use field names by setting the fieldnameheader option to yes. Script writers can also assign field names explicitly if desired.


Recognized data formats

Data files or streams should be plain ASCII text, not binary, and should be organized as a collection of rows having one or more fields. Fields may have numeric or alphanumeric content and may be delimited in one of these ways:


whitespace delimited
	F1 2.43 0.47 Jane_Doe     PF7955   
	F2 2.79 0.28 John_Smith   PT2705
	F3 2.62 0.37 Ken_Brown    PB2702
	F4  -    -   Bud_Flippner PX7205
	...
Fields are delimited by any mixture of one or more spaces or tabs. No quote processing is done. Blank fields must be represented using a nonblank code, and alphanumeric fields cannot contain white space. Embedded spaces must be represented some other way, such as with underscores.


spacequote delimited
	F1 2.43 0.47 "Jane Doe"   PF7955
	F2 2.79 0.28 "John Smith" PT2705
	F3 2.62 0.37 "Ken Brown"  PB2702
	F4 "" "" "Bud Flippner"   PX7205
This is a variant of whitespace delimitation where fields may be enclosed in double quotes ("), and quoted fields may have embedded white space. Blank fields may be represented as shown or using a code.


tab delimited
	F1	2.43	0.47	Jane Doe
	F2	2.79	0.28	John Smith
	F3	2.62	0.37	Ken Brown
	F4			Bud Flippner
	...
Fields are separated by a single tab. Zero length fields are taken to be blank. Data fields cannot have embedded tabs. The first field must start at the very beginning of the line. The last field in a row may be terminated by a tab or not.


comma delimited
	"F1",2.43,0.47,"Jane Doe"
	"F2",2.79,0.28,"John Smith"
	"F3",2.62,0.37,"Ken Brown"
	"F4",,,"Hello""world"
	...
Also known as comma-quote delimited or CSV. Fields are separated by commas. Alphanumeric fields are enclosed in double quotes (although ploticus really doesn't care about this unless a field contains embedded whitespace or comma characters). Zero length fields and fields containing "" are taken to be blank. An embedded double quote is represented using ("") as seen in row F4 above. No whitespace is allowed before or after fields (although this apparently is tolerated in the CSV spec).


Notes regarding data input and parsing

Numeric values in scientific notation - as of 2.30 these should be handled transparently.

Empty rows and commented rows are ignored. The default comment symbol is // and it should appear before any other content on a line. An alternate comment symbol can be specified if desired.

Data sets with variable number of fields may be accomodated by specifying nfields. Otherwise, the first usable row will dictate the expected number of fields per record. If a row has more than the expected number of fields, extra fields are silently ignored. If a row has less than the expected number of fields, blank fields are silently added until the record has same number of fields as other records. nfields may also be used to read only the first few fields on every row, and ignore the rest.

Rows may be conditionally selected at the time of reading by specifying a select condition. Rows not meeting the condition will be skipped.

Leading white space is allowed when using whitespace or spacequoted delimitation. It is not allowed on the other types.

Comma-delimited data files may include commented lines and empty lines, but comment symbol must be at beginning of line, and empty lines may not contain any whitespace.

Each row, including the last one, should be terminated with a newline or CR/LF.

Data that is specified within a ploticus script is subject to script processing: leading white space is stripped off and the script interpreter will attempt to evaluate constructs that look like operators or variables.


Missing data

Missing data values may be represented using a code or by a zero-length field, depending on the delimitation method. A value is considered missing if it is non-plottable.. ie if plotting numerics any non-numeric value is considered missing data; if plotting dates any value that isn't a date (in the current format) is considered missing data. When plotting, missing values are generally skipped over, but exactly what occurs depends on what kind of plot operation is being done.


Embedded #set statements

Data files may contain embedded #set statements for setting prefab parameters and ploticus variables directly from the data file. The syntax is:
#set VARIABLE = value.
or #set parametername = value.

All tokens are separated by whitespace and quoting is never used. Here's an example of a data file with embedded #set statements:

  #set mytitle = Orders processed on Tue 8 Jul '03
  #set ymax = 40
  ABC	3	4	11	42.3
  DEF	5	2	48	27.4
  GHI	9	1	79	37.3
  ...

As noted in the docs, some prefab parameters, notably those controlling data input, cannot be set this way.


Other possibilities

Since ploticus can read data on standard input, there are many possibilities for getting data for plotting. To get data out of an SQL database, use your database's command line tool to extract tabular ASCII data. Or, to get data across the internet using a URL, use a utility like Jeff Poskanzer's http_get. Be sure to set delim appropriately. These examples illustrate:

mysql acars < mycommand.sql | pl -prefab ... data=stdin delim=tab..
http_get "http://abc.net/delta/jan28.dat" | pl -prefab ... data=stdin ..

If you are developing ploticus scripts, and your data exists in a state such that additional processing is required in order to work with it, you may be able to accomplish the desired manipulation within ploticus. To select certain fields, reformat fields, concatenate fields, etc., try using a proc getdata filter. To perform accumulation, tabulation and counting, rewriting as percents, computation of totals, reversing record order, rotation of row/column matrix, break processing, etc., proc processdata may be useful (it operates on the data after they have been read in).

Script writers wishing to embed large amounts of data directly into a script may be interested in proc trailer, which allows the data to be given at the end of the script file, to get it out of the way.


The current data set

Within a script, proc getdata can be invoked any number of times to read in data. However there can only be one active data set at any one time. This is referred to as the "current data set".

Note that proc getdata isn't the only way that the current data set can be filled. Proc processdata and proc tabulate perform computations using the current data set, and then produce a new data set as a result.. and the result then becomes the current data set.

It is possible for the original data set to remain in memory when proc processdata produces new results-- in fact if proc processdata is invoked several times there can be several data sets "stacked" in memory (this is an effective technique when plotting a series of derivations). By default the most recently created data set is the "current" one. Proc usedata can be used to select an earlier data set to be the current one. If a new data set is read by proc getdata this entire stack structure is cleared. Data sets cannot be stacked by proc getdata.


Examples

Here are some script examples:
scat7.dat (white-space delimited)
stock.csv (comma delimited)
timeline3 (data specified within script)
km2 (data specified within script).


data display engine  
Copyright Steve Grubb


Markup created by unroff 1.0,    August 23, 2005.