2. Dataset

The dataset classes aim to emulate some of the functionality of NumPy’s ndarray class. In addition, datasets can have items which contain multiple elements - these are known as compound datasets.

2.1. Generic dataset

Capabilities
  • Indexing
  • Slicing (now returns views)
  • Boolean dataset selection
  • Some advanced integer indexing
  • Arithmetic operations
  • Comparison and logic operations
  • Mathematical and statistical functions
Key differences
  • Incomplete implementation all NumPy’s methods
  • ndarray.get_elements method to get view of n-th elements of a compound array
Implemented NumPy methods (1.10)
  • Array attributes: shape, ndim, data, size, itemsize, nbytes, T, view, indices
  • Array methods: copy, fill, reshape, resize, transpose, flatten, squeeze, take, put, max, min, sum, prod, all, any, argmax, argmin, cumsum, cumprod
  • Array creation: array, zeros, zeros_like, ones, ones_like, empty, full, full_like, eye, identity, linspace, logspace, arange, diag, diagflat, meshgrid, indices
  • Array manipulation: append, tile, repeat, concatenate, vstack, hstack, dstack, array_split, split, vsplit, hsplit, dsplit, fliplr, flipud, roll, rot90, reshape, resize, ravel, rollaxis, atleast_1d, atleast_2d, atleast_3d
  • Array modification: fill, append
  • Array comparisons and logic operations: all, any, greater, greater_equal, less, less_equal, equal, not_equal, logical_not, logical_and, logical_or, logical_xor, allclose, nonzero, select, where, choose
  • Indexing operations: unravel_index, ravel_multi_index
  • Maths: add, subtract, multiply, divide, negative, power, real, imag, absolute, exp, log, log2, log10, expm1, log1p, sqrt, square, reciprocal, angle, conjugate, floor_divide, remainder, phase, signum, diff, gradient, roots, modf, fmod, interp
  • Trig: sin, cos, tan, arcsin, arccos, arctan, arctan2, hypot, sinh, cosh, tanh, arcsinh, arccosh, arctanh, deg2rad, rad2deg
  • Matrix: dot, vdot, inner, outer, matmul, tensordot, kron, trace
  • Linear algebra: eig, eigvals, cholesky, qr, svd, norm, cond, det, solve, inv, pinv, matrix_power
  • Rounding: rint, ceil, floor, trunc, fix
  • Bitwise: bitwise_and, bitwise_or, bitwise_xor, invert, left_shift, right_shift
  • Stats: amax, amin, ptp, mean, std, var, cumprod, cumsum, histogram
  • Random: rand, randint, random_integer, randn, exponential, poisson, seed
Non-NumPy
  • Array methods: cast, minpos, maxpos, index (Jython only)
  • Maths: dividez, cbrt
  • Stats: skewness, kurtosis, iqr, quantile
  • Import/export JAMA [1] (Jython only)
  • phase, residual, normalise, centroid, crossings

This is an example of dataset usage:

import scisoftpy as dnp

# create a 1D dataset from 0 to 9 of doubles
a = dnp.arange(10)
# make it have 2 rows and 5 columns
a.shape = 2,5
a
# create new dataset with each element raised to power of 5/2
b = dnp.power(a,2.5)
# create new dataset that is sum of two datasets
c = a + b
# modify dataset in-place by dividing each element by corresponding
# element in other dataset
c /= a

# reassign a to a new (integer) dataset of 2 rows and 3 columns
a = dnp.array([[0, 1, 2], [3, 4, 5]])
# create new dataset from a slice of dataset which takes just the 2nd column
d = a[:,1]
# modify dataset and set a slice to a given value
a[1,0:1] = -2

# b is a double dataset (highest type that can contain all entries is used)
b = dnp.array([[-2, 2.3], [1.2, 9.3]])
# modify dataset and set a slice to the values in another dataset
a[:,1:] = b
# notice the values from b are truncated to integers
a

# in Jython only
import Jama.Matrix as Matrix
#  make a dataset from a Jama Matrix
m = Matrix([[0,1,1.5],[2,3,3.5]])
dj = dnp.array(m) # directly
da = dnp.array(m.getArray()) # from Java array
db = dnp.array([m.getArray(), m.getArray()]) # from list of arrays

import scisoftpy.random as drd
# create dataset of shape 3,12 of uniform random numbers between 0 and 1
a = drd.rand(3,12)
# take item-wise sine
dnp.sin(a)
# create dataset of shape 3,4 of random integers from 0 to 11 inclusive
drd.randint(12, size=(3,4))

# comparison and logic
# create boolean datasets where items are true where condition applies
c = a >= 0.3
d = a < 0.2
# check if all or any of items in these are true
all(c)
any(c)
all(d)
all(d)
# how many were true
c.sum()
d.sum()
# flatten 1D dataset of items which were >= 0.3
a[c]
# assign value where items are < 0.4
a[a < 0.4] = 0

#

2.2. RGB dataset

When a colour image is loaded (as described in the next chapter), a RGB dataset is created. This type of dataset has items which are tuples of three 16-bit integers. Each integer represents a value of one of the colour channels. The channels are ordered as red, green and blue. There are four extra attributes to an RGB dataset, red, green, blue and grey which retrieve copies of the colour channel or a weighted mixture of channels in the grey case.

There are also four extra methods:

get_red(self, dtype=None)
get_green(self, dtype=None)
get_blue(self, dtype=None)
get_grey(self, cweights=None, dtype=None)

where dtype is an optional dataset type (default is int16) and cweights is an optional set of weight for combining the colour channel. The default weights are (0.299, 0.587, 0.114) which correspond to the NTSC formula for convert RGB to luma values.

2.3. Lazy dataset

Sometimes a need arises to reference a multitude of datasets or very large datasets. This need can occur when dealing with datasets held in tree structured-file formats. The memory usage could easily exceed most computers’ capabilities, so to facilitate this need, the concept of a lazy dataset is required. The laziness refers to the deferred action of loading data from file into memory. This in turn dictates that lazy datasets have few properties:

shape
shape of dataset
name
name of dataset
dtype
dataset type

The only access is provided by indexing or slicing a lazy dataset:

v = lazy[0,1] # index an item
s = lazy[1,:] # slice
entire = lazy[...] # to load in entire dataset (do so with caution)