Package smile.io

Interface Read


public interface Read
Reads data from external storage systems.
  • Method Details

    • object

      static Object object(Path path) throws IOException, ClassNotFoundException
      Reads a serialized object from a file.
      Parameters:
      path - the file path.
      Returns:
      the serialized object.
      Throws:
      IOException - when fails to read the stream.
      ClassNotFoundException - when fails to load the class.
    • data

      Reads a data file. Infers the data format by the file name extension.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      ParseException - when fails to parse the file.
      URISyntaxException - when the file path syntax is wrong.
    • data

      static DataFrame data(String path, String format) throws IOException, URISyntaxException, ParseException
      Reads a data file. Infers the data format by the file name extension.
      Parameters:
      path - the input file path.
      format - the optional file format specification. For csv files, it is such as delimiter=\t,header=true,comment=#,escape=\,quote=". For json files, it is the file mode (single-line or multi-line). For avro files, it is the path to the schema file.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      ParseException - when fails to parse the file.
      URISyntaxException - when the file path syntax is wrong.
    • csv

      static DataFrame csv(String path) throws IOException, URISyntaxException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • csv

      static DataFrame csv(String path, String format) throws IOException, URISyntaxException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      format - the format specification in key-value pairs such as delimiter=\t,header=true,comment=#,escape=\,quote=".
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • csv

      static DataFrame csv(String path, org.apache.commons.csv.CSVFormat format) throws IOException, URISyntaxException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      format - the CSV file format.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • csv

      static DataFrame csv(String path, org.apache.commons.csv.CSVFormat format, StructType schema) throws IOException, URISyntaxException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      format - the CSV file format.
      schema - the data schema.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • csv

      static DataFrame csv(Path path) throws IOException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • csv

      static DataFrame csv(Path path, org.apache.commons.csv.CSVFormat format) throws IOException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      format - the CSV file format.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • csv

      static DataFrame csv(Path path, org.apache.commons.csv.CSVFormat format, StructType schema) throws IOException
      Reads a CSV file.
      Parameters:
      path - the input file path.
      format - the CSV file format.
      schema - the data schema.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • json

      static DataFrame json(String path) throws IOException, URISyntaxException
      Reads a JSON file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • json

      static DataFrame json(String path, JSON.Mode mode, StructType schema) throws IOException, URISyntaxException
      Reads a JSON file.
      Parameters:
      path - the input file path.
      mode - the file mode (single-line or multi-line).
      schema - the data schema.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • json

      static DataFrame json(Path path) throws IOException
      Reads a JSON file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • json

      static DataFrame json(Path path, JSON.Mode mode, StructType schema) throws IOException
      Reads a JSON file.
      Parameters:
      path - the input file path.
      mode - the file mode (single-line or multi-line).
      schema - the data schema.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • arff

      Reads an ARFF file. Weka ARFF (attribute relation file format) is an ASCII text file format that is essentially a CSV file with a header that describes the meta-data. ARFF was developed for use in the Weka machine learning software.

      A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified, together with their data type, each definition on a single line. The actual observations are then listed, each on a single line, with fields separated by commas, much like a CSV file.

      Missing values in an ARFF dataset are identified using the question mark '?'.

      Comments can be included in the file, introduced at the beginning of a line with a '%', whereby the remainder of the line is ignored.

      A significant advantage of the ARFF data file over the CSV data file is the metadata information.

      Also, the ability to include comments ensure we can record extra information about the data set, including how it was derived, where it came from, and how it might be cited.

      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      ParseException - when fails to parse the file.
      URISyntaxException - when the file path syntax is wrong.
    • arff

      static DataFrame arff(Path path) throws IOException, ParseException
      Reads an ARFF file. Weka ARFF (attribute relation file format) is an ASCII text file format that is essentially a CSV file with a header that describes the meta-data. ARFF was developed for use in the Weka machine learning software.

      A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified, together with their data type, each definition on a single line. The actual observations are then listed, each on a single line, with fields separated by commas, much like a CSV file.

      Missing values in an ARFF dataset are identified using the question mark '?'.

      Comments can be included in the file, introduced at the beginning of a line with a '%', whereby the remainder of the line is ignored.

      A significant advantage of the ARFF data file over the CSV data file is the metadata information.

      Also, the ability to include comments ensure we can record extra information about the data set, including how it was derived, where it came from, and how it might be cited.

      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      ParseException - when fails to parse the file.
    • sas

      static DataFrame sas(String path) throws IOException, URISyntaxException
      Reads a SAS7BDAT file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • sas

      static DataFrame sas(Path path) throws IOException
      Reads a SAS7BDAT file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • arrow

      static DataFrame arrow(String path) throws IOException, URISyntaxException
      Reads an Apache Arrow file. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • arrow

      static DataFrame arrow(Path path) throws IOException
      Reads an Apache Arrow file. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • avro

      static DataFrame avro(String path, InputStream schema) throws IOException, URISyntaxException
      Reads an Apache Avro file.
      Parameters:
      path - the input file path.
      schema - the input stream of data schema.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • avro

      static DataFrame avro(String path, String schema) throws IOException, URISyntaxException
      Reads an Apache Avro file.
      Parameters:
      path - the input file path.
      schema - the data schema file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • avro

      static DataFrame avro(Path path, InputStream schema) throws IOException
      Reads an Apache Avro file.
      Parameters:
      path - the input file path.
      schema - the input stream of data schema.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • avro

      static DataFrame avro(Path path, Path schema) throws IOException
      Reads an Apache Avro file.
      Parameters:
      path - the input file path.
      schema - the data schema file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • parquet

      static DataFrame parquet(String path) throws IOException, URISyntaxException
      Reads an Apache Parquet file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • parquet

      static DataFrame parquet(Path path) throws IOException
      Reads an Apache Parquet file.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • libsvm

      Reads a libsvm sparse dataset. The format of libsvm file is:
       
       <label> <index1>:<value1> <index2>:<value2> ...
       
      where label is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. index is an integer starting from 1, and value is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
      URISyntaxException - when the file path syntax is wrong.
    • libsvm

      static SparseDataset<Integer> libsvm(Path path) throws IOException
      Reads a libsvm sparse dataset. The format of libsvm file is:
       
       <label> <index1>:<value1> <index2>:<value2> ...
       
      where label is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. index is an integer starting from 1, and value is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
      Parameters:
      path - the input file path.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.
    • libsvm

      static SparseDataset<Integer> libsvm(BufferedReader reader) throws IOException
      Reads a libsvm sparse dataset. The format of libsvm file is:
       
       <label> <index1>:<value1> <index2>:<value2> ...
       
      where label is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. index is an integer starting from 1, and value is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
      Parameters:
      reader - the file reader.
      Returns:
      the data frame.
      Throws:
      IOException - when fails to read the file.