Skip to content

Releases: pathwaycom/pathway

v0.13.1

27 Jun 10:31
Compare
Choose a tag to compare

Added

  • pw.io.kafka.read now accepts an autogenerate_key flag. This flag determines the primary key generation policy to apply when reading raw data from the source. You can either use the key from the Kafka message or have Pathway autogenerate one.
  • pw.io.deltalake.read input connector that fetches changes from DeltaLake into a Pathway table.
  • pw.xpacks.llm.parsers.OpenParse which allows parsing tables and images in PDFs.

Fixed

  • All S3 input connectors (including S3, Min.io, Digital Ocean, and Wasabi) now automatically retry network operations if a failure occurs.
  • The issue where the connection to the S3 source fails after partially ingesting an object has been resolved by downloading the object in full first.

v0.13.0

13 Jun 12:12
Compare
Choose a tag to compare

Added

  • pw.io.deltalake.write now supports S3 destinations.

Changed

  • pw.debug.compute_and_print now allows passing more than one table.
  • BREAKING: path parameter in pw.io.deltalake.write renamed to uri.

Fixed

  • A bug in pw.Table.deduplicate. If persistent_id is not set, it is no longer generated in pw.PersistenceMode.SELECTIVE_PERSISTING mode.

v0.12.0

10 Jun 06:06
Compare
Choose a tag to compare

Added

  • pw.PyObjectWrapper that enables passing python objects of any type to the engine.
  • cache_strategy option added for pw.io.http.rest_connector. It enables cache configuration, which is useful for duplicated requests.
  • allow_misses argument to Table.ix and Table.ix_ref methods which allows for filling rows with missing keys with None values.
  • pw.io.deltalake.write output connector that streams the changes of a given table into a DeltaLake storage.
  • pw.io.airbyte.read now supports data extraction with Google Cloud Runs.

Removed

  • BREAKING: Removed Table.having method.
  • BREAKING: Removed pw.DATE_TIME_UTC, pw.DATE_TIME_NAIVE and pw.DURATION as dtype markers. Instead, pw.DateTimeUtc, pw.DateTimeNaive and pw.Duration should be used, which are wrappers for corresponding pandas types.
  • BREAKING: Removed class transformers from public API: pw.ClassArg, pw.attribute, pw.input_attribute, pw.input_method, pw.method, pw.output_attribute and pw.transformer.
  • BREAKING: Removed several methods from pw.indexing module: binsearch_oracle, filter_cmp_helper, filter_smallest_k and prefix_sum_oracle.

v0.11.2

27 May 08:33
Compare
Choose a tag to compare

Added

  • pathway.assert_table_has_schema and pathway.table_transformer now accept allow_subtype argument, which, if True, allows column types in the Table be subtypes of types in the Schema.
  • next method to pw.io.python.ConnectorSubject (python connector) that enables passing values of any type to the engine, not only values that are json-serializable. The next method should be the preferred way of passing values from the python connector.

Changed

  • The format argument of pw.io.python.read is deprecated. A data format is inferred from the method used (next_json, next_str, next_bytes) and the provided schema.

Removed

  • Removed pw.numba_apply and numba dependency.

Fixed

  • Fixed pw.this desugaring bug, where __getitem__ in .ix context was not working properly.
  • pw.io.sqlite.read now checks if the data matches the passed schema.

v0.11.1

16 May 19:30
Compare
Choose a tag to compare

Added

  • query and query_as_of_now of pathway.stdlib.indexing.data_index.DataIndex now accept in metadata_column parameter a column with data of type str | None.
  • pathway.xpacks.connectors.sharepoint module under Pathway for Business License.

v0.11.0

10 May 14:56
Compare
Choose a tag to compare

Added

  • Embedders in the LLM xpack now have method get_embedding_dimension that returns number of dimension used by the chosen embedder.
  • pathway.stdlib.indexing.nearest_neighbors, with implementations of pathway.stdlib.indexing.data_index.InnerIndex based on k-NN via LSH (implemented in Pathway), and k-NN provided by USearch library.
  • pathway.stdlib.indexing.vector_document_index, with a few predefined instances of pathway.stdlib.indexing.data_index.DataIndex.
  • pathway.stdlib.indexing.bm25, with implementations of pathway.stdlib.indexing.data_index.InnerIndex based on BM25 index provided by Tantivy.
  • pathway.stdlib.indexing.full_text_document_index, with a predefined instance of pathway.stdlib.indexing.data_index.DataIndex.
  • Introduced the reranker module under llm.xpacks. Includes few re-ranking strategies and utility functions for RAG applications.

Changed

  • BREAKING: windowby generates IDs of produced rows differently than in the previous version.
  • BREAKING: pw.io.csv.write prints printable non-ascii characters as regular text, not \u{xxxx}.
  • BREAKING: Connector methods pw.io.elasticsearch.read, pw.io.debezium.read, pw.io.fs.read, pw.io.jsonlines.read, pw.io.kafka.read, pw.io.python.read, pw.io.redpanda.read, pw.io.s3.read now check the type of the input data. Previously it was not checked if the provided format was "json"/"jsonlines". If the data is inconsistent with the provided schema, the row is skipped and the error message is emitted.
  • BREAKING: query and query_as_of_now methods of pathway.stdlib.indexing.data_index.DataIndex now return pathway.JoinResult, to allow resolving column name conflicts (between columns in the table with queries and table with index data).
  • BREAKING: DataIndex methods query and query_as_of_now now return score in a column named _pw_index_reply_score (defined as _SCORE variable in pathway.stdlib.indexing.colnames.py).

Removed

  • BREAKING: pathway.stdlib.indexing.data_index.VectorDocumentIndex class, some predefined instances are now meant to be obtained via methods provided in pathway.stdlib.indexing.vector_document_index.
  • BREAKING: with_distances parameter of query and query_as_of_now methods in pathway.stdlib.indexing.data_index.DataIndex. Instead of 'distance', we now operate with a more general term 'score' (higher = better). For distance based indices score is usually defined as negative distance. Score is now always included in the answer, as long as underlying index returns something that indicates quality of a match.

v0.10.1

30 Apr 12:25
Compare
Choose a tag to compare

Added

  • query method to VectorStoreServer to enable compatible API with DataIndex.
  • AdaptiveRAGQuestionAnswerer to xpacks.question_answering. End-to-end pipeline and accompanying code for Private RAG showcase.

v0.10.0

24 Apr 22:21
Compare
Choose a tag to compare

Added

  • Pathway now warns when unintentionally creating Table with empty universe.
  • pw.io.kafka.write in raw and plaintext formats now supports output for tables with multiple columns. For such tables, it requires the specification of the column that must be used as a value of the produced Kafka messages and gives a possibility to provide column which must be used as a key.
  • pw.io.kafka.write can now output values from the table using Kafka message headers in 'raw' and 'plaintext' output format.

Changed

  • instance arguments to groupby, join, with_id_from now determine how entries are distributed between machines.
  • flatten results remain on the same machine as their source entries.
  • join sends each record between machines at most once.
  • BREAKING: flatten, join, groupby (if used with instance), with_id_from (if used with instance) generate IDs of the produced rows differently than in the previous versions.
  • pathway spawn with multiple workers prints only output from the first worker.

v0.9.0

18 Apr 21:01
Compare
Choose a tag to compare

Added

  • pw.reducers.latest and pw.reducers.earliest that return the value with respectively maximal and minimal processing time assigned.
  • pw.io.kafka.write can now produce messages containing raw bytes in case the table consists of a single binary column and raw mode is specified. Similarly, this method will provide plaintext messages if plaintext mode is chosen and the table consists of a single string-typed column.
  • pw.io.pubsub.write connector for publishing Pathway tables into Google PubSub.
  • Argument strict_prompt to answer_with_geometric_rag_strategy and answer_with_geometric_rag_strategy_from_index that allows optimizing prompts for smaller open-source LLM models.
  • Temporarily switch LiteLLMChat's generation method to sync version due to a bug while using json mode with Ollama.

Changed

  • BREAKING: pw.io.kafka.read will not parse the messages from UTF-8 in case raw mode was specified. To preserve this behavior you can use the plaintext mode.
  • BREAKING: Table.flatten now flattens one column and spreads every other column of the table, instead of taking other columns from the argument list.

v0.8.6

10 Apr 20:16
Compare
Choose a tag to compare

Added

  • pw.io.bigquery.write connector for writing Pathway tables into Google BigQuery.
  • parameter filepath_globpattern to query method in VectorStoreClient for specifying which files should be considered in the query.
  • Improved compatibility of pw.Json with standard methods such as len(), int(), float(), bool(), iter(), reversed() when feasible.

Changed

  • pw.io.postgres.write can now parallelize writes to several threads if several workers are configured.
  • Pathway now checks types of pointers rigorously. Indexing table with mismatched number/types of columns vs what was used to create index will now result in a TypeError.
  • pw.Json.as_float() method now supports integer JSON values.