Package smile.data

Record Class DataFrame

java.lang.Object
java.lang.Record
smile.data.DataFrame
Record Components:
schema - the schema of DataFrame.
columns - the columns of DataFrame.
index - the optional row index.
All Implemented Interfaces:
Serializable, Iterable<Row>

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index) extends Record implements Iterable<Row>, Serializable
Two-dimensional, potentially heterogeneous tabular data.
See Also:
  • Constructor Details

    • DataFrame

      public DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
      Creates an instance of a DataFrame record class.
      Parameters:
      schema - the value for the schema record component
      columns - the value for the columns record component
      index - the value for the index record component
    • DataFrame

      public DataFrame(ValueVector... columns)
      Constructor.
      Parameters:
      columns - the columns of DataFrame.
    • DataFrame

      public DataFrame(RowIndex index, ValueVector... columns)
      Constructor.
      Parameters:
      index - the row index.
      columns - the columns of DataFrame.
  • Method Details

    • toString

      public String toString()
      Returns a string representation of this record class. The representation contains the name of the class, followed by the name and value of each of the record components.
      Specified by:
      toString in class Record
      Returns:
      a string representation of this object
    • names

      public String[] names()
      Returns the column names.
      Returns:
      the column names.
    • dtypes

      public DataType[] dtypes()
      Returns the column data types.
      Returns:
      the column data types.
    • measures

      public Measure[] measures()
      Returns the column's level of measurements.
      Returns:
      the column's level of measurements.
    • shape

      public int shape(int dim)
      Returns the size of given dimension. For pandas user's convenience.
      Parameters:
      dim - the dimension index.
      Returns:
      the size of given dimension.
    • size

      public int size()
      Returns the number of rows. This is an alias to nrow for Java's convention.
      Returns:
      the number of rows.
    • nrow

      public int nrow()
      Returns the number of rows.
      Returns:
      the number of rows.
    • ncol

      public int ncol()
      Returns the number of columns.
      Returns:
      the number of columns.
    • isEmpty

      public boolean isEmpty()
      Returns true if the data frame is empty.
      Returns:
      true if the data frame is empty.
    • setIndex

      public DataFrame setIndex(String column)
      Sets the DataFrame index using existing column. The index column will be removed from the DataFrame.
      Parameters:
      column - the name of column that will be used as row index.
      Returns:
      a new DataFrame with the row index.
    • setIndex

      public DataFrame setIndex(Object[] index)
      Sets the DataFrame index.
      Parameters:
      index - the row index values.
      Returns:
      a new DataFrame with the row index.
    • column

      public ValueVector column(int j)
      Returns the j-th column.
      Parameters:
      j - the column index.
      Returns:
      the column vector.
    • column

      public ValueVector column(String name)
      Returns the column of given name.
      Parameters:
      name - the column name.
      Returns:
      the column vector.
    • apply

      public ValueVector apply(String name)
      Returns the column of given name. This is an alias to column for Scala's convenience.
      Parameters:
      name - the column name.
      Returns:
      the column vector.
    • apply

      public DataFrame apply(String... names)
      Returns a new DataFrame with selected columns. This is an alias to select for Scala's convenience.
      Parameters:
      names - the column names.
      Returns:
      a new DataFrame with selected columns.
    • get

      public Tuple get(int i)
      Returns the row at the specified index.
      Parameters:
      i - the row index.
      Returns:
      the i-th row.
    • apply

      public Tuple apply(int i)
      Returns the row at the specified index. This is an alias to get for Scala's convenience.
      Parameters:
      i - the row index.
      Returns:
      the i-th row.
    • loc

      public Tuple loc(Object row)
      Returns the row with the specified index.
      Parameters:
      row - the row index.
      Returns:
      the row with the specified index.
    • loc

      public DataFrame loc(Object... rows)
      Returns a new data frame with specified rows.
      Parameters:
      rows - the row indices.
      Returns:
      a new data frame with specified rows.
    • get

      public DataFrame get(Index index)
      Returns a new data frame with row indexing.
      Parameters:
      index - the row indexing.
      Returns:
      the data frame of selected rows.
    • apply

      public DataFrame apply(Index index)
      Returns a new data frame with row indexing. This is an alias to get for Scala's convenience.
      Parameters:
      index - the row indexing.
      Returns:
      the data frame of selected rows.
    • get

      public DataFrame get(boolean[] index)
      Returns a new data frame with boolean indexing.
      Parameters:
      index - the boolean indexing.
      Returns:
      the data frame of selected rows.
    • apply

      public DataFrame apply(boolean[] index)
      Returns a new data frame with boolean indexing. This is an alias to get for Scala's convenience.
      Parameters:
      index - the boolean indexing.
      Returns:
      the data frame of selected rows.
    • isNullAt

      public boolean isNullAt(int i, int j)
      Checks whether the value at position (i, j) is null or missing value.
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      true if the cell value is null.
    • get

      public Object get(int i, int j)
      Returns the cell at (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the cell value.
    • apply

      public Object apply(int i, int j)
      Returns the cell at (i, j). This is an alias to get for Scala's convenience.
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the cell value.
    • getInt

      public int getInt(int i, int j)
      Returns the int value at position (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the int value of cell.
    • getLong

      public long getLong(int i, int j)
      Returns the long value at position (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the long value of cell.
    • getFloat

      public float getFloat(int i, int j)
      Returns the float value at position (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the float value of cell.
    • getDouble

      public double getDouble(int i, int j)
      Returns the double value at position (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the double value of cell.
    • getString

      public String getString(int i, int j)
      Returns the string representation of the value at position (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the string representation of cell value.
    • getScale

      public String getScale(int i, int j)
      Returns the value at position (i, j) of NominalScale or OrdinalScale.
      Parameters:
      i - the row index.
      j - the column index.
      Returns:
      the cell scale.
      Throws:
      ClassCastException - when the data is not nominal or ordinal.
    • set

      public void set(int i, int j, Object value)
      Sets the value at position (i, j).
      Parameters:
      i - the row index.
      j - the column index.
      value - the new value.
    • update

      public void update(int i, int j, Object value)
      Updates the value at position (i, j). This is an alias to set for Scala's convenience.
      Parameters:
      i - the row index.
      j - the column index.
      value - the new value.
    • stream

      public Stream<Row> stream()
      Returns a (possibly parallel) Stream of rows.
      Returns:
      a (possibly parallel) Stream of rows.
    • iterator

      @Nonnull public Iterator<Row> iterator()
      Specified by:
      iterator in interface Iterable<Row>
    • toList

      public List<Row> toList()
      Returns the List of rows.
      Returns:
      the List of rows.
    • dropna

      public DataFrame dropna()
      Returns a new data frame without rows that have null/missing values.
      Returns:
      the data frame without null/missing values.
    • fillna

      public DataFrame fillna(double value)
      Fills null/NaN/Inf values of numeric columns with the specified value.
      Parameters:
      value - the value to replace NAs.
      Returns:
      this data frame.
    • select

      public DataFrame select(int... indices)
      Returns a new DataFrame with selected columns.
      Parameters:
      indices - the column indices.
      Returns:
      a new DataFrame with selected columns.
    • select

      public DataFrame select(String... names)
      Returns a new DataFrame with selected columns.
      Parameters:
      names - the column names.
      Returns:
      a new DataFrame with selected columns.
    • drop

      public DataFrame drop(int... indices)
      Returns a new DataFrame without selected columns.
      Parameters:
      indices - the column indices.
      Returns:
      a new DataFrame without selected columns.
    • drop

      public DataFrame drop(String... names)
      Returns a new DataFrame without selected columns.
      Parameters:
      names - the column names.
      Returns:
      a new DataFrame without selected columns.
    • add

      public DataFrame add(ValueVector... vectors)
      Adds columns to this data frame.
      Parameters:
      vectors - the columns to add.
      Returns:
      this dataframe.
    • set

      public DataFrame set(String name, ValueVector column)
      Sets the column values. If the column does not exist, adds it as the last column of the dataframe.
      Parameters:
      name - the column name.
      column - the new column value.
      Returns:
      this dataframe.
    • update

      public DataFrame update(String name, ValueVector column)
      Sets the column values. If the column does not exist, adds it as the last column of the dataframe. This is an alias to set for Scala's convenience.
      Parameters:
      name - the column name.
      column - the new column value.
      Returns:
      this dataframe.
    • join

      public DataFrame join(DataFrame other)
      Joins two data frames on their index. If either dataframe has no index, merges them horizontally by columns.
      Parameters:
      other - the data frames to merge.
      Returns:
      a new data frame with combined columns.
    • merge

      public DataFrame merge(DataFrame... dataframes)
      Merges data frames horizontally by columns. If there are columns with the same name, the latter ones will be renamed with suffix such as _2, _3, etc.
      Parameters:
      dataframes - the data frames to merge.
      Returns:
      a new data frame with combined columns.
    • concat

      public DataFrame concat(DataFrame... dataframes)
      Concatenates data frames vertically by rows.
      Parameters:
      dataframes - the data frames to concatenate.
      Returns:
      a new data frame that combines all the rows.
    • factorize

      public DataFrame factorize(String... names)
      Returns a new DataFrame with given columns converted to nominal.
      Parameters:
      names - column names. If empty, all object columns in the data frame will be converted.
      Returns:
      a new DataFrame.
    • toArray

      public double[][] toArray(String... columns)
      Return an array obtained by converting the columns in a data frame to numeric mode and then binding them together as the columns of a matrix. Missing values/nulls will be encoded as Double.NaN. No bias term and uses level encoding for categorical variables.
      Parameters:
      columns - the columns to export. If empty, all columns will be used.
      Returns:
      the numeric array.
    • toArray

      public double[][] toArray(boolean bias, CategoricalEncoder encoder, String... names)
      Return an array obtained by converting the columns in a data frame to numeric mode and then binding them together as the columns of a matrix. Missing values/nulls will be encoded as Double.NaN.
      Parameters:
      bias - if true, add the first column of all 1's.
      encoder - the categorical variable encoder.
      names - the columns to export. If empty, all columns will be used.
      Returns:
      the numeric array.
    • toMatrix

      public Matrix toMatrix()
      Return a matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Missing values/nulls will be encoded as Double.NaN.
      Returns:
      the numeric matrix.
    • toMatrix

      public Matrix toMatrix(boolean bias, CategoricalEncoder encoder, String rowNames)
      Return a matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Missing values/nulls will be encoded as Double.NaN. No bias term and uses level encoding for categorical variables.
      Parameters:
      bias - if true, add the first column of all 1's.
      encoder - the categorical variable encoder.
      rowNames - the column to be used as row names.
      Returns:
      the numeric matrix.
    • describe

      public DataFrame describe()
      Returns the data structure and statistics.
      Returns:
      the data structure and statistics.
    • head

      public String head(int numRows)
      Returns the string representation of top rows.
      Parameters:
      numRows - the number of rows to show.
      Returns:
      the string representation of top rows.
    • tail

      public String tail(int numRows)
      Returns the string representation of bottom rows.
      Parameters:
      numRows - the number of rows to show.
      Returns:
      the string representation of bottom rows.
    • toString

      public String toString(int from, int to, boolean truncate)
      Returns the string representation of rows in specified range.
      Parameters:
      from - the initial index of the range to show, inclusive
      to - the final index of the range to show, exclusive.
      truncate - Whether truncate long strings and align cells right.
      Returns:
      the string representation of rows in specified range.
    • of

      public static DataFrame of(double[][] data, String... names)
      Creates a DataFrame from a 2-dimensional array.
      Parameters:
      data - the data array.
      names - the name of columns.
      Returns:
      the data frame.
    • of

      public static DataFrame of(float[][] data, String... names)
      Creates a DataFrame from a 2-dimensional array.
      Parameters:
      data - the data array.
      names - the name of columns.
      Returns:
      the data frame.
    • of

      public static DataFrame of(int[][] data, String... names)
      Creates a DataFrame from a 2-dimensional array.
      Parameters:
      data - the data array.
      names - the name of columns.
      Returns:
      the data frame.
    • of

      public static <T> DataFrame of(Class<T> clazz, List<T> data)
      Creates a DataFrame from a collection of objects.
      Type Parameters:
      T - The data type of elements.
      Parameters:
      clazz - The class type of elements.
      data - The data collection.
      Returns:
      the data frame.
    • of

      public static DataFrame of(StructType schema, Stream<? extends Tuple> data)
      Creates a DataFrame from a stream of tuples.
      Parameters:
      data - The data stream.
      Returns:
      the data frame.
    • of

      public static DataFrame of(StructType schema, List<? extends Tuple> data)
      Creates a DataFrame from a set of tuples.
      Parameters:
      schema - The schema of tuple.
      data - The data collection.
      Returns:
      the data frame.
    • of

      public static DataFrame of(ResultSet rs) throws SQLException
      Creates a DataFrame from a JDBC ResultSet.
      Parameters:
      rs - The JDBC result set.
      Returns:
      the data frame.
      Throws:
      SQLException - when JDBC operation fails.
    • hashCode

      public final int hashCode()
      Returns a hash code value for this object. The value is derived from the hash code of each of the record components.
      Specified by:
      hashCode in class Record
      Returns:
      a hash code value for this object
    • equals

      public final boolean equals(Object o)
      Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared with Objects::equals(Object,Object).
      Specified by:
      equals in class Record
      Parameters:
      o - the object with which to compare
      Returns:
      true if this object is the same as the o argument; false otherwise.
    • schema

      public StructType schema()
      Returns the value of the schema record component.
      Returns:
      the value of the schema record component
    • columns

      public List<ValueVector> columns()
      Returns the value of the columns record component.
      Returns:
      the value of the columns record component
    • index

      public RowIndex index()
      Returns the value of the index record component.
      Returns:
      the value of the index record component