SentenceTransformer

This is a sentence-transformers model. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

It is a static embedding model! The main purpose of it is to calculate similarity between russian and bashkir sentences.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: inf tokens
  • Output Dimensionality: 256 dimensions
  • Similarity Function: Cosine Similarity
  • Language: Bashkir

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): StaticEmbedding(
    (embedding): EmbeddingBag(120138, 256, mode='mean')
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("BorisTM/static_rus_bak")
# Run inference
sentences = [
    '19.Башҡортостан Республикаһында ниндәй милләттәр йәшәй?',
    'Какие экономические реформы были проведены в России в XIX веке?',
    'Валерий Газзаев также отметил, что сегодня футбол не просто является спортом.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.4605, -0.0718],
#         [ 0.4605,  1.0000, -0.1179],
#         [-0.0718, -0.1179,  1.0000]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 10,246,566 training samples
  • Columns: bak and rus
  • Approximate statistics based on the first 1000 samples:
    bak rus
    type string string
    details
    • min: 2 characters
    • mean: 84.01 characters
    • max: 536 characters
    • min: 2 characters
    • mean: 83.26 characters
    • max: 563 characters
  • Samples:
    bak rus
    Ref-de Профиль на transfermarkt.de (нем.) Профиль на transfermarkt.de (нем.)
    Уның тәүәккәл эш итеүе арҡаһында был әҙәм зарарһыҙландырыла. Со свойственным ему упрямством этот человек пытается исполнить свою угрозу.
    Ростов стадионы архитектура үҙенсәлектәре башҡа стадиондарҙыҡынан айырылып торасаҡ. Нарушение технологического процесса в одном, безусловно, скажется на других этапах.
  • Loss: MultipleNegativesSymmetricRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Evaluation Dataset

rus_bak_real

  • Dataset: rus_bak_real
  • Size: 10,000 evaluation samples
  • Columns: bak and rus
  • Approximate statistics based on the first 1000 samples:
    bak rus
    type string string
    details
    • min: 3 characters
    • mean: 85.78 characters
    • max: 1025 characters
    • min: 4 characters
    • mean: 85.84 characters
    • max: 967 characters
  • Samples:
    bak rus
    с йылдырымом юсуп а не валерой! Освежает потрясающе!
    Беренсе шуныһы ташлана күҙе - блузка уңайлы, туника, салбар. Первое, что бросается в глаза - чехол, плотный и удобный.
    Һауаның уртаса йыллыҡ театраһы — 2,3°С, ғин. уртаса температура — -15°С, июлдә — 21°С. Абсолютная максимальная температура — 38°С, абс. миним. театра — -48,0°С. Яуым-төшөмдөң уртаса йыллыҡ миҡдары — 450 мм. Среднегодовая температура воздуха – 2,3°С, средняя температура янв. – -15°С, июля – 21°С. Абсолютная максимальная температура – 38°С, абс. миним. театра – -48,0°С. Среднегодовое количество осадков – 450 мм.
  • Loss: MultipleNegativesSymmetricRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 8192
  • per_device_eval_batch_size: 256
  • learning_rate: 0.02
  • weight_decay: 0.01
  • max_steps: 1200
  • warmup_ratio: 0.05
  • bf16: True
  • bf16_full_eval: True
  • remove_unused_columns: False
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 8192
  • per_device_eval_batch_size: 256
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 0.02
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3.0
  • max_steps: 1200
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: True
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: True
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: False
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 5.1.2
  • Transformers: 4.57.3
  • PyTorch: 2.3.1
  • Accelerate: 1.12.0
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Model Card Authors

Malashenko Boris

Model Card Contact

[email protected]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train BorisTM/static_rus_bak