Package smile.data
- Qualitative variables:
- The data values are non-numeric categories. Examples: Blood type, Gender.
- Quantitative variables:
- The data values are counts or numerical measurements. A quantitative variable can be either discrete such as the number of students receiving an 'A' in a class, or continuous such as GPA, salary and so on.
- Nominal data:
- data values are non-numeric group labels. For example, Gender variable can be defined as male = 0 and female =1.
- Ordinal data:
- data values are categorical and may be ranked in some numerically meaningful way. For example, strongly disagree to strong agree may be defined as 1 to 5.
- Continuous data:
-
Interval data:
data values are ranged in a real interval, which can be as large as
from negative infinity to positive infinity. The difference between two
values are meaningful, however, the ratio of two interval data is not
meaningful. For example temperature, IQ.
Ratio data: both difference and ratio of two values are meaningful. For example, salary, weight.
The cost of having many objects is that each object in a JVM must have some metadata that is associated with it. For example, the java.lang.Class value that represents the type of that object, or the length of an array object. The most common approach is to place this metadata at the start of the object, creating an object header.
For a large or complex object, the size of the header is relatively insignificant. For a small object, however, the size of the header can become significant. For byte[1], 64 bits of metadata are often required for a single 8-bit value. Additionally, the JVM is likely to add at least 3 bytes of padding to ensure that the subsequent object in the heap starts on an aligned address. The total extra memory requirement for 8 bits of data is therefore 88 bits. Every object has a similar associated overhead, so the more objects you have, the greater the effect on system resources.
The structure of Java arrays can exaggerate this overhead. Consider an array of Complex objects. Each instance of the Complex class has two double values, of 64 bits each, plus the object header. Assuming that the header is just the class reference, and occupies only 32 bits, each Point instance is 8 bytes of data and 4 bytes of extra overhead. An array of 10 Complex objects consists of the header (class + length = 8 bytes), plus 10 object references (assuming 4 bytes each = 40 bytes). If each element of the array contains a unique Complex object, the total is 160 bytes of data, but 88 bytes of additional overhead.
The data locality of a tree of objects also has huge impact to compute efficiency. Modern hardware relies heavily on caching and prefetching to provide efficient access. Caching exploits the observation that memory that was recently accessed is likely to be accessed again soon, so keeping the most recently accessed data in very fast memory usually results in the best performance. Data is cached in small blocks, which are known as cache lines, to exploit another observation: data that is stored in sequence is often accessed in sequence. Code that accesses array[i] often proceeds to access array[i+1].
When a data structure is composed of many different objects, an operation on the information might need to access several objects to locate the actual data. However, a tree of related objects cannot be guaranteed to be close enough in memory to appear in the same block of cached memory. Some JVM configurations attempt to keep related objects close to each other in memory, but this result is not always possible. Even when the JVM can place objects next to each other, the space that is required by the object header lies between the objects, possibly disrupting the benefit.
-
ClassDescriptionAbstract tuple base class.Binary sparse dataset.Categorical variable encoder.An immutable collection of data organized into named columns.Stream collectors.Dataset<D,
T> An immutable collection of data objects.A data frame with a new index instead of the default [0, n) row index.SampleInstance<D,T> An immutable sample instance.List of Lists sparse matrix format.An in-process SQL database management interface.A tuple is an immutable finite ordered list (sequence) of elements.