SentenceTransformer
This is a sentence-transformers model. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
It is a static embedding model! The main purpose of it is to calculate similarity between russian and bashkir sentences.
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: inf tokens
- Output Dimensionality: 256 dimensions
- Similarity Function: Cosine Similarity
- Language: Bashkir
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): StaticEmbedding(
(embedding): EmbeddingBag(120138, 256, mode='mean')
)
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("BorisTM/static_rus_bak")
# Run inference
sentences = [
'19.Башҡортостан Республикаһында ниндәй милләттәр йәшәй?',
'Какие экономические реформы были проведены в России в XIX веке?',
'Валерий Газзаев также отметил, что сегодня футбол не просто является спортом.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.4605, -0.0718],
# [ 0.4605, 1.0000, -0.1179],
# [-0.0718, -0.1179, 1.0000]])
Training Details
Training Dataset
Unnamed Dataset
- Size: 10,246,566 training samples
- Columns:
bakandrus - Approximate statistics based on the first 1000 samples:
bak rus type string string details - min: 2 characters
- mean: 84.01 characters
- max: 536 characters
- min: 2 characters
- mean: 83.26 characters
- max: 563 characters
- Samples:
bak rus Ref-de Профиль на transfermarkt.de (нем.)Профиль на transfermarkt.de (нем.)Уның тәүәккәл эш итеүе арҡаһында был әҙәм зарарһыҙландырыла.Со свойственным ему упрямством этот человек пытается исполнить свою угрозу.Ростов стадионы архитектура үҙенсәлектәре башҡа стадиондарҙыҡынан айырылып торасаҡ.Нарушение технологического процесса в одном, безусловно, скажется на других этапах. - Loss:
MultipleNegativesSymmetricRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false }
Evaluation Dataset
rus_bak_real
- Dataset: rus_bak_real
- Size: 10,000 evaluation samples
- Columns:
bakandrus - Approximate statistics based on the first 1000 samples:
bak rus type string string details - min: 3 characters
- mean: 85.78 characters
- max: 1025 characters
- min: 4 characters
- mean: 85.84 characters
- max: 967 characters
- Samples:
bak rus с йылдырымом юсуп а не валерой!Освежает потрясающе!Беренсе шуныһы ташлана күҙе - блузка уңайлы, туника, салбар.Первое, что бросается в глаза - чехол, плотный и удобный.Һауаның уртаса йыллыҡ театраһы — 2,3°С, ғин. уртаса температура — -15°С, июлдә — 21°С. Абсолютная максимальная температура — 38°С, абс. миним. театра — -48,0°С. Яуым-төшөмдөң уртаса йыллыҡ миҡдары — 450 мм.Среднегодовая температура воздуха – 2,3°С, средняя температура янв. – -15°С, июля – 21°С. Абсолютная максимальная температура – 38°С, абс. миним. театра – -48,0°С. Среднегодовое количество осадков – 450 мм. - Loss:
MultipleNegativesSymmetricRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: stepsper_device_train_batch_size: 8192per_device_eval_batch_size: 256learning_rate: 0.02weight_decay: 0.01max_steps: 1200warmup_ratio: 0.05bf16: Truebf16_full_eval: Trueremove_unused_columns: Falseload_best_model_at_end: True
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 8192per_device_eval_batch_size: 256per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 0.02weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 3.0max_steps: 1200lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.05warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Truefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Truedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Falselabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthproject: huggingfacetrackio_space_id: trackioddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: noneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Trueprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 5.1.2
- Transformers: 4.57.3
- PyTorch: 2.3.1
- Accelerate: 1.12.0
- Datasets: 4.4.1
- Tokenizers: 0.22.1
Model Card Authors
Malashenko Boris