File size: 2,566 Bytes
42f5b98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
"""Document entity model for representing source files."""

from dataclasses import dataclass
from pathlib import Path
from typing import Optional


@dataclass
class DocumentMetadata:
    """Metadata for a source document."""

    file_path: str
    language: Optional[str] = None
    size_bytes: int = 0
    line_count: int = 0
    encoding: str = "utf-8"

    @property
    def extension(self) -> str:
        """Get file extension."""
        return Path(self.file_path).suffix.lstrip(".")


@dataclass
class Document:
    """Represents a source code file loaded for processing."""

    content: str
    metadata: DocumentMetadata
    repo_id: str = ""

    @property
    def file_path(self) -> str:
        """Convenience accessor for file path."""
        return self.metadata.file_path

    @property
    def language(self) -> Optional[str]:
        """Convenience accessor for language."""
        return self.metadata.language

    @classmethod
    def from_file(cls, file_path: Path, repo_root: Path, repo_id: str = "") -> "Document":
        """Create Document from a file path."""
        content = file_path.read_text(encoding="utf-8")
        relative_path = str(file_path.relative_to(repo_root))
        line_count = content.count("\n") + 1 if content else 0

        language = _detect_language(file_path.suffix)

        metadata = DocumentMetadata(
            file_path=relative_path,
            language=language,
            size_bytes=file_path.stat().st_size,
            line_count=line_count,
        )

        return cls(content=content, metadata=metadata, repo_id=repo_id)


def _detect_language(extension: str) -> Optional[str]:
    """Detect programming language from file extension."""
    extension_map = {
        ".py": "python",
        ".js": "javascript",
        ".ts": "typescript",
        ".jsx": "javascript",
        ".tsx": "typescript",
        ".java": "java",
        ".go": "go",
        ".rs": "rust",
        ".rb": "ruby",
        ".php": "php",
        ".c": "c",
        ".cpp": "cpp",
        ".h": "c",
        ".hpp": "cpp",
        ".cs": "csharp",
        ".swift": "swift",
        ".kt": "kotlin",
        ".scala": "scala",
        ".md": "markdown",
        ".rst": "restructuredtext",
        ".yaml": "yaml",
        ".yml": "yaml",
        ".json": "json",
        ".toml": "toml",
        ".xml": "xml",
        ".html": "html",
        ".css": "css",
        ".sql": "sql",
        ".sh": "bash",
        ".bash": "bash",
        ".zsh": "zsh",
    }
    return extension_map.get(extension.lower())